Faiss: k-means support for L_f norm, where 0<f<1

Created on 20 Feb 2019  路  9Comments  路  Source: facebookresearch/faiss

This is an enhancement request.

I know faiss does not support custom distance metrics. But it would be nice to have k-means based on L_f distance where f is positive and less than 1. In high-dimensional spaces, using these fractional distance metrics seems to provide better clustering results - see https://bib.dbvis.de/uploadedFiles/155.pdf.

enhancement

Most helpful comment

k-means cannot minimize arbitrary distances.
The _mean_ only minimizes squared Euclidean distances.
It does not minimize Euclidean distances, only the squares; and that is _not_ the same.
So the algorithm won't be able to minimize "fractional" Lp-norms either. It's _not_ as simple as assigning all points to the nearest center with some other distance, you also need to choose the optimum center for that other distance (and that won't be the arithmetic mean).

All 9 comments

duplicate of #313

Closing in favor of #313.

Closing in favor of #313.

313 has already been closed

k-means cannot minimize arbitrary distances.
The _mean_ only minimizes squared Euclidean distances.
It does not minimize Euclidean distances, only the squares; and that is _not_ the same.
So the algorithm won't be able to minimize "fractional" Lp-norms either. It's _not_ as simple as assigning all points to the nearest center with some other distance, you also need to choose the optimum center for that other distance (and that won't be the arithmetic mean).

k-means cannot minimize arbitrary distances.
The _mean_ only minimizes squared Euclidean distances.
It does not minimize Euclidean distances, only the squares; and that is _not_ the same.
So the algorithm won't be able to minimize "fractional" Lp-norms either. It's _not_ as simple as assigning all points to the nearest center with some other distance, you also need to choose the optimum center for that other distance (and that won't be the arithmetic mean).

@kno10, I was referring to the empirical results in Section 4 of the paper https://bib.dbvis.de/uploadedFiles/155.pdf. You are correct that it is not straightforward to minimize fractional norms. From the paper, "For our experiments we used one of the most widely used standard clustering algorithms - the k-means algorithm... ... The results of our experiments show that the fractional distance metrics provides a much higher classification rate ..." They don't explain what exactly they did but I get the impression they do use the arithmetic mean to calculate the centroids but use the fractional distance to extract cluster assignment in each iteration.

On a _synthetic_ data set of Gaussian clusters, where the arithmetic mean will obviously work, because the clusters are actually spherical.
Their theorem that fractional norms are better apparently only holds for uniform data, as shown here:
Francois, Damien, Vincent Wertz, and Michel Verleysen. "The concentration of fractional distances." IEEE Transactions on Knowledge and Data Engineering 19.7 (2007): 873-886.
But go ahead if you consider this important.

On a _synthetic_ data set of Gaussian clusters, where the arithmetic mean will obviously work, because the clusters are actually spherical.
Their theorem that fractional norms are better apparently only holds for uniform data, as shown here:
Francois, Damien, Vincent Wertz, and Michel Verleysen. "The concentration of fractional distances." IEEE Transactions on Knowledge and Data Engineering 19.7 (2007): 873-886.
But go ahead if you consider this important.

I understand it worked well on their synthetic data because it was Gaussian. Thanks for the reference you provided - https://perso.uclouvain.be/michel.verleysen/papers/tkde07df.pdf. I requested this enhancement assuming it would be relatively easy to implement, I could very well be wrong. It will provide one more tunable parameter that might work well with fractional distances for certain datasets.

Thanks for the discussion.
As said previously, other distance metrics are not in our priorities, but we'll keep the issue open as a reminder.

Other distances, including L_f, have been added to Faiss in 1.5.3, however they are supported only for a few index types.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

brunodoamaral picture brunodoamaral  路  3Comments

minjiaz picture minjiaz  路  3Comments

jukaradayi picture jukaradayi  路  3Comments

maozezhong picture maozezhong  路  3Comments

0DF0Arc picture 0DF0Arc  路  3Comments