Gensim: Bug in sklearn_api.hdp and in sklearn_api.ldamodel

Created on 31 Oct 2017 · 3Comments · Source: RaRe-Technologies/gensim

Description

The new sklearn_api.hdp (and also ldamodel) modules and/or their combination with matutils.Sparse2Corpus yield the wrong results when fitting models from sklearn vectorizers or other sklearn-styled sparse matrices, since they are stored in CSR format. This error might occur in other sklearn_api classes, too.

I believe that either the default value of Sparse2Corpus constructor's documents_columns parameter should be changed:

class Sparse2Corpus(object):
    """
    Convert a matrix in scipy.sparse format into a streaming gensim corpus.
    This is the mirror function to `corpus2csc`.
    """

    # Change this to def __init__(self, sparse, documents_columns=False) ?
    def __init__(self, sparse, documents_columns=True):
        if documents_columns:
            self.sparse = sparse.tocsc()
        else:
            self.sparse = sparse.tocsr().T  # make sure shape[1]=number of docs (needed in len())

or the following call should include that documents_columns=False:

# Same happens in lda model
class HdpTransformer(TransformerMixin, BaseEstimator):
    ...
    def fit(self, X, y=None):
        """
        Fit the model according to the given training data.
        Calls gensim.models.HdpModel
        """
        if sparse.issparse(X):
            corpus = matutils.Sparse2Corpus(X) # Change to matutils.Sparse2Corpus(X, False) ?
        ...

Steps/Code/Corpus to Reproduce

Example: The number of processed documents corresponds to the number of features.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_api.hdp import HdpTransformer

# Small dataset configuration
ngs = fetch_20newsgroups()
samples = ngs.data[:100]

# Simple count vectorization
vectorizer = CountVectorizer()
# x is a sparse matrix
x = vectorizer.fit_transform(samples)
print("%d documents, %d features" % x.shape) # 100 documents, 6547 features

inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}

# Train a HDP
hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x)

# Should be 100 but got 6547
print("Processed documents %d" % hdp_transformer.gensim_model.m_num_docs_processed) # 6547

Expected Results

We should expect that the hdp gensim model had processed the 100 documents in samples.

print("%d documents, %d features" % x.shape)
# 100 documents, 6547 features

Actual Results

HDP processed 6547 documents

print("Processed documents: %d" % hdp_transformer.gensim_model.m_num_docs_processed)
# Processed documents: 6547

Workaround, for the record

To make it work correctly, the sparse matrix x should be transposed.

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(samples)
inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}

hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x.T)

Versions

Windows-8.1-6.3.9600
('Python', '2.7.13 |Anaconda custom (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.12.1')
('SciPy', '1.0.0')
('gensim', '3.0.1')
('FAST_VERSION', 0)

bug difficulty easy good first issue

Source

mmunozm

Most helpful comment

@menshikh-iv I agree that it would be better to change the call being made from the sklearn_wrapper code than to change the default behaviour, as doing that might affect other parts of the existing code as well.
Yes, I can get this done (along with the issue in fasttext's load method) within this week. My schedule has been a little swamped recently so sorry again for the delay and thanks a lot for your patience! :)