Scikit-learn: ValueError in distance matrix with agglomerative clustering

Created on 6 Nov 2017  路  7Comments  路  Source: scikit-learn/scikit-learn

Description

ValueError thrown when applying AgglomerativeClustering on textual data because distance matrix contains infinite values

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
def main():
        dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, 
                                                            remove=('headers', 'footers', 'quotes') )
        data_samples = dataset.data
    targets = dataset.target
    categories = dataset.target_names
    k = np.unique(targets).shape[0]
    tf_vectorizer = TfidfVectorizer(max_features=50000, max_df=1.0, min_df=1)
    tfs = tf_vectorizer.fit_transform(data_samples)
    agg = AgglomerativeClustering(linkage="complete", n_clusters=k, affinity="cosine")
    agg.fit(tfs.toarray())
    return dataset

if __name__ == '__main__':
    main()

Expected Results

No error is thrown and the distance matrix should not contain infinite values

Actual Results

File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 750, in fit
    **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 527, in _complete_linkage
    return linkage_tree(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 417, in linkage_tree
    out = hierarchy.linkage(X, method=linkage, metric=affinity)
  File "/venv/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 713, in linkage
    raise ValueError("The condensed distance matrix must contain only "
ValueError: The condensed distance matrix must contain only finite values.

Versions

>>> import platform; print(platform.platform())
Linux-4.4.0-81-generic-x86_64-with-Ubuntu-16.04-xenial
>>> import sys; print("Python", sys.version)
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.0.0
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.19.0
>>> 

Comment
I have used the same code on a subset of Reuters-21578 text data set and no error was thrown. I was not able to track down what might have caused the infinite values in the distance matrix

Most helpful comment

You could get NaN cosine values if a vector has no non-zero elements. Is
this possible in your case?

All 7 comments

You could get NaN cosine values if a vector has no non-zero elements. Is
this possible in your case?

Hi @jnothman ,
after some deeper investigating (should have done that before ;) I found out that there were empty text documents which resulted, as you suggested, in vectors with no non-zero elements. I tracked it down and it's because the remove=('headers', 'footers', 'quotes') parameter for the fetch_20newsgroups function cuts the whole text in some documents.
Thanks for the tip!

So is it fine to close this?

Yes, although I wonder if it would be better if the distance of two zero-valued vectors should be simply zero instead of non-finite. You think it makes sense to track it down or is this expected behaviour?

Actually this is a duplicate of #7689, so see there...

For me, the problem was that the gram_matrix contained identical observations, which meant that the condensed distance matrix contained only zeros.

I've discovered that all 1's will cause the same error. I searched for these df.columns[df.nunique() == 1] and dropped them and my problem was solved.

Was this page helpful?
0 / 5 - 0 ratings