Gensim: similarity_matrix: suspicous column indexing logic when tfidf model supplied

Created on 7 Mar 2018 · 5Comments · Source: RaRe-Technologies/gensim

Description

Expected gensim.models.keyedvectors WordEmbeddingsKeyedVectors property similarity_matrix to respect the reordering of terms dictated by a supplied tfidf model. This does not seem to be the case.

Steps/Code/Corpus to Reproduce

This will be an unconventional bug report because I did not find this bug "in the wild". In adapting the code for another purpose, I noticed some logic which might be problematic. I will

point out where I think the mistake lies in the original context
show what corrected the problem in my context
provide the test case that demonstrates correctness in my context

Original Context:

https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L516

        for row_number, w1_index in enumerate(word_indices):
            ...
            w1 = dictionary[w1_index]
            ...
            # Traverse upper triangle columns.
            if matrix_order <= nonzero_limit + 1:  # Traverse all columns.
                columns = (
                    (w2_index, self.similarity(w1, dictionary[w2_index]))
                    for w2_index in range(w1_index + 1, matrix_order)
                    if w1_index != w2_index and dictionary[w2_index] in self.vocab)
            ...
            for w2_index, similarity in columns:
                # Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle.
                if similarity > threshold and matrix_nonzero[w2_index] <= nonzero_limit:
                    element = similarity**exponent
                    matrix[w1_index, w2_index] = element
                    matrix_nonzero[w1_index] += 1
                    matrix[w2_index, w1_index] = element
                    matrix_nonzero[w2_index] += 1

A red flag for me was that while we consider the "word_indices" mapping for the row index, we are never considering it for the column index. Even in the case of rows, notice that we are filling "matrix" at the end without regard for the reordering that was supposed to have occurred.

A Correction in Levenshtein Context:

for row_number, w1_index in enumerate(word_indices):
...
    # Traverse upper triangle columns.
    # TODO: determine community preference for pylev.levenshtein or distance.levenshtein
    import pylev
    columns = []
    for col_number in range(row_number + 1, matrix_order):
        if row_number != col_number:
            w2_index = word_indices[col_number]
            w1 = dictionary[w1_index]
            w2 = dictionary[w2_index]
            similarity = pylev.levenshtein(w1, w2)/\
                         float(max(len(w1), len(w2)))
            columns.append((col_number, similarity))

    for col_number, similarity in columns:
        element = alpha * (1 - similarity) ** beta
        matrix[row_number, col_number] = element
        matrix[col_number, row_number] = element

Test Case

Consider the mini corpus:

mini_texts = [['abc'], ['lab', 'abc'], ['bad', 'abc']]
mini_dict = gensim.corpora.Dictionary(mini_texts)
mini_corpus = [mini_dict.doc2bow(text) for text in mini_texts]
mini_tfidf = gensim.models.TfidfModel(mini_corpus)

The default ordering of terms is in order of appearance: 'abc', 'lab', 'bad'.
The Levenshtein similarity matrix returned without supplying tfidf will be:
```
matrix([[1. , 0.00740741, 0. ],
[0.00740741, 1. , 0.00740741],
[0. , 0.00740741, 1. ]], dtype=float32)

However, since 'abc' appears in every document, the tfidf coefficient will be zero. The ordering with respect to the tfidf weight will be: 'lab', 'bad', 'abc'. The default Levenshtein similarity scores are pretty symmetric, there are some reorderings for which the matrix will be the same. However, this example is contrived to so that the tfidf reordering will break the symmetry of the original Levenshtein similarity scores:

matrix([[1. , 0.00740741, 0.00740741],
[0.00740741, 1. , 0. ],
[0.00740741, 0. , 1. ]], dtype=float32)

```
We see that the revised logic correctly returns the reordered scores. The original logic returned the first matrix when the tfidf model was supplied to the function. A test case for the similarity_matrix function would need to be constructed so that a reordering of terms would lead to a definitive reordering of the relevant scores in the resulting matrix.

Conclusion

While I have not taken the time to prove definitively that there is a bug in the original context, a unit test should be written to cover the case of a supplied tfidf reordering the terms in similarity_matrix. This is a high-risk piece of logic. If there is a bug, and a supplied tfidf isn't being incorporated properly, eventually SoftCosineSimilarity will return nonsense scores when fed documents transformed by the same tfidf model.

Relevant Code

existing unit tests:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_keyedvectors.py#L30

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Apr 4 2017, 08:46:44) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
('NumPy', '1.14.1')
('SciPy', '1.0.0')
('gensim', '3.4.0')
('FAST_VERSION', 0)

difficulty easy

Source

nataliedurgin

All 5 comments

Thank you for the report. So far, this does not seem to be a bug to me. Several things to note:

If we pass the condition at the line that you mention in your report:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L515
then the similarity matrix is actually of lower order than nonzero_limit + 1 and we will process all columns. In this scenario, the order of columns does not matter – we are filling everything.
- If we don't pass the above condition (i.e. the similarity matrix is actually of greater order than nonzero_limit + 1), then we go down the following path:
  
  https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L520
  
  and we take the columns corresponding to the closest terms in the embedding space. Within these columns, the order of processing does not matter again. I.e. we store only the largest elements and leave the remaining columns at zero.

Even the documentation specifies that “[…] The rows of the term similarity matrix will be build in an increasing order of importance of terms […]”. The order in which columns are processed is not affected.

Witiko on 8 Mar 2018

👍1

It is entirely possible I am misunderstanding the purpose of the option of providing a tfidf model into this function. Let me offer a test case (this time in context) that highlights what believe its purpose to be.

import gensim
mini_texts = [['ab'], ['abc', 'ab'], ['bcd', 'ab']]
mini_dict = gensim.corpora.Dictionary(mini_texts)
mini_corpus = [mini_dict.doc2bow(text) for text in mini_texts]
mini_tfidf = gensim.models.TfidfModel(mini_corpus)
w2v_cbow = gensim.models.Word2Vec(mini_texts, sg=1, min_count=1)
w2v_cbow.wv.similarity_matrix(mini_dict, threshold=0.0, exponent=1.0).todense()
Out[365]: 
matrix([[1.       , 0.0664964, 0.       ],
        [0.0664964, 1.       , 0.       ],
        [0.       , 0.       , 1.       ]], dtype=float32)
w2v_cbow.wv.similarity_matrix(mini_dict, tfidf=mini_tfidf, threshold=0.0, exponent=1.0).todense()
Out[366]: 
matrix([[1.       , 0.0664964, 0.       ],
        [0.0664964, 1.       , 0.       ],
        [0.       , 0.       , 1.       ]], dtype=float32)

My understanding is that a document vector, in this case, is a 3-dimensional vector, and the elements in the vector are values that are ordered according to the term order ['ab', 'abc', 'bcd']. Any document vector in the tfidf transformed corpus will be represented by 3-dimensional vector of tfidf coefficients that take their place with respect to the new tfidf weighted term order ['abc', 'bcd', 'ab'].

With respect to the default ordering I would expect:

| | 'ab' | 'abc' | 'bcd' |
|-------|------|-------|-------|
| 'ab' | 1 | # | 0 |
| 'abc' | # | 1 | 0 |
| 'bcd' | 0 | 0 | 1 |

With respect to the tfidf ordering I would expect:

| | 'abc' | 'bcd' | 'ab' |
|-------|-------|-------|------|
| 'abc' | 1 | 0 | # |
| 'bcd' | 0 | 1 | 0 |
| 'ab' | # | 0 | 1 |

The current function returns the same matrix both times. Won't this cause term mismatches when computing soft cosine similarity with tfidf transformed documents?

nataliedurgin on 8 Mar 2018

Thank you for your patience. I see now I had a misconception about the gensim tfidf transformation reordering features. I am not sure where that originated, but I am very glad to have it corrected!

Just to make sure I understand, the tfidf parameter here is not meant to restructure the output in any way, it is supplied as a means to improve performance?

nataliedurgin on 8 Mar 2018

That is correct. When the matrix is empty, we can take a row and just fill all columns corresponding to the closest terms in the embedding space. When the matrix is already half-filled, some columns in a row (that do not necessarily correspond to the closest terms) will already be pre-filled (due to the symmetric matrix[w2_index, w1_index] = element assignment) and some columns corresponding to the closest terms will not be fillable (due to the if … and matrix_nonzero[w2_index] <= nonzero_limit: constraint).

The role of the tfidf parameter is to process the rows that correspond to important terms before the matrix has filled up; experiments suggest this improves performance compared to an arbitrary row processing order. Other algorithms for pruning the matrix (such as disregarding symmetry and just taking nonzero_limit closest terms in each row) are possible, but unimplemented. Not pruning the matrix is also an option with small datasets.

Witiko on 8 Mar 2018

👍1

However, the documentation should be reworded, since the rows are processed in _decreasing_, not increasing, order of importance. I will make a commit that clarifies the documentation and closes this issue.

Witiko on 8 Mar 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings