The new sklearn_api.hdp (and also ldamodel) modules and/or their combination with matutils.Sparse2Corpus yield the wrong results when fitting models from sklearn vectorizers or other sklearn-styled sparse matrices, since they are stored in CSR format. This error might occur in other sklearn_api classes, too.
I believe that either the default value of Sparse2Corpus constructor's documents_columns parameter should be changed:
class Sparse2Corpus(object):
"""
Convert a matrix in scipy.sparse format into a streaming gensim corpus.
This is the mirror function to `corpus2csc`.
"""
# Change this to def __init__(self, sparse, documents_columns=False) ?
def __init__(self, sparse, documents_columns=True):
if documents_columns:
self.sparse = sparse.tocsc()
else:
self.sparse = sparse.tocsr().T # make sure shape[1]=number of docs (needed in len())
or the following call should include that documents_columns=False:
# Same happens in lda model
class HdpTransformer(TransformerMixin, BaseEstimator):
...
def fit(self, X, y=None):
"""
Fit the model according to the given training data.
Calls gensim.models.HdpModel
"""
if sparse.issparse(X):
corpus = matutils.Sparse2Corpus(X) # Change to matutils.Sparse2Corpus(X, False) ?
...
Example: The number of processed documents corresponds to the number of features.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_api.hdp import HdpTransformer
# Small dataset configuration
ngs = fetch_20newsgroups()
samples = ngs.data[:100]
# Simple count vectorization
vectorizer = CountVectorizer()
# x is a sparse matrix
x = vectorizer.fit_transform(samples)
print("%d documents, %d features" % x.shape) # 100 documents, 6547 features
inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}
# Train a HDP
hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x)
# Should be 100 but got 6547
print("Processed documents %d" % hdp_transformer.gensim_model.m_num_docs_processed) # 6547
We should expect that the hdp gensim model had processed the 100 documents in samples.
print("%d documents, %d features" % x.shape)
# 100 documents, 6547 features
HDP processed 6547 documents
print("Processed documents: %d" % hdp_transformer.gensim_model.m_num_docs_processed)
# Processed documents: 6547
To make it work correctly, the sparse matrix x should be transposed.
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(samples)
inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}
hdp_transformer = HdpTransformer(inv_vocab)
hdp_transformer.fit(x.T)
Windows-8.1-6.3.9600
('Python', '2.7.13 |Anaconda custom (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.12.1')
('SciPy', '1.0.0')
('gensim', '3.0.1')
('FAST_VERSION', 0)
Thanks for the detailed report @mmunozm, add explicitly documents_columns to wrappers is a better idea I think (than change default behavior).
@chinmayapancholi13 are you agree? can you fix this?
@menshikh-iv I agree that it would be better to change the call being made from the sklearn_wrapper code than to change the default behaviour, as doing that might affect other parts of the existing code as well.
Yes, I can get this done (along with the issue in fasttext's load method) within this week. My schedule has been a little swamped recently so sorry again for the delay and thanks a lot for your patience! :)
Most helpful comment
@menshikh-iv I agree that it would be better to change the call being made from the sklearn_wrapper code than to change the default behaviour, as doing that might affect other parts of the existing code as well.
Yes, I can get this done (along with the issue in fasttext's load method) within this week. My schedule has been a little swamped recently so sorry again for the delay and thanks a lot for your patience! :)