ValueError thrown when applying AgglomerativeClustering on textual data because distance matrix contains infinite values
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
def main():
dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42,
remove=('headers', 'footers', 'quotes') )
data_samples = dataset.data
targets = dataset.target
categories = dataset.target_names
k = np.unique(targets).shape[0]
tf_vectorizer = TfidfVectorizer(max_features=50000, max_df=1.0, min_df=1)
tfs = tf_vectorizer.fit_transform(data_samples)
agg = AgglomerativeClustering(linkage="complete", n_clusters=k, affinity="cosine")
agg.fit(tfs.toarray())
return dataset
if __name__ == '__main__':
main()
No error is thrown and the distance matrix should not contain infinite values
File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 750, in fit
**kwargs)
File "/venv/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 527, in _complete_linkage
return linkage_tree(*args, **kwargs)
File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 417, in linkage_tree
out = hierarchy.linkage(X, method=linkage, metric=affinity)
File "/venv/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 713, in linkage
raise ValueError("The condensed distance matrix must contain only "
ValueError: The condensed distance matrix must contain only finite values.
>>> import platform; print(platform.platform())
Linux-4.4.0-81-generic-x86_64-with-Ubuntu-16.04-xenial
>>> import sys; print("Python", sys.version)
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.0.0
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.19.0
>>>
Comment
I have used the same code on a subset of Reuters-21578 text data set and no error was thrown. I was not able to track down what might have caused the infinite values in the distance matrix
You could get NaN cosine values if a vector has no non-zero elements. Is
this possible in your case?
Hi @jnothman ,
after some deeper investigating (should have done that before ;) I found out that there were empty text documents which resulted, as you suggested, in vectors with no non-zero elements. I tracked it down and it's because the remove=('headers', 'footers', 'quotes') parameter for the fetch_20newsgroups function cuts the whole text in some documents.
Thanks for the tip!
So is it fine to close this?
Yes, although I wonder if it would be better if the distance of two zero-valued vectors should be simply zero instead of non-finite. You think it makes sense to track it down or is this expected behaviour?
Actually this is a duplicate of #7689, so see there...
For me, the problem was that the gram_matrix contained identical observations, which meant that the condensed distance matrix contained only zeros.
I've discovered that all 1's will cause the same error. I searched for these df.columns[df.nunique() == 1] and dropped them and my problem was solved.
Most helpful comment
You could get NaN cosine values if a vector has no non-zero elements. Is
this possible in your case?