Faiss: PCA Matrix reduction significantly increases the size of the index and reduces the accuracy

Created on 9 Apr 2018 · 5Comments · Source: facebookresearch/faiss

Summary

[question]
Was trying to implement Index pre-transform for reducing dimensions from 512 to 128

Platform

OS: Ubuntu 16.04

Running on :

[x] CPU
[ ] GPU

Reproduction instructions

Randomly generated data of dimensions 512 , reduced to 128 using PCA matrix with IVFPQ index with polysemous training , ht=512 and nprobe = 16.

The size of the Index on disk rose ( using faiss.write_index ) while the accuracy dropped , both significantly (Size almost as big as 75% of the input data and the accuracy < 6%)

Is this an expected outcome ?

question

Source

Aashit-Sharma

All 5 comments

Hi
The drop in accuracy is expected since the dimension of uniform random data cannot be reduced in a meaningful way.
The size of the index increases by a fixed amount: the size of the PCA matrix.

mdouze on 9 Apr 2018

👍1

Hi ,
I did expect reduction in accuracy but not so significantly that it kinda doesnt make sense to incorporate it.
I will close this as I think it I need to work more on its tuning to maybe get something more than negligible accuracy on a small toy set
Thanks !

Aashit-Sharma on 11 Apr 2018

You may want to try out doing PCA yourself.
From a quick look, the current faiss implementation may have a severe numerical precision issue in PCA. I just don't know when I'll have time to do prepare a simple data set to demonstrate the problem.

kno10 on 6 Apr 2019

@kno10 curious on what dataset the numerical precision appears.
https://github.com/facebookresearch/faiss/wiki/Implementation-notes#pca-matrix-computation
Alternative PCA implementation:
https://github.com/facebook/fbpca

mdouze on 8 Apr 2019

Here are two-sample, two-dimensional test cases (designed for fp32, but you can design similar cases for fp64, too):

d0=numpy.array([[     1,     0.],[     0.,     1.]]).astype('float32')
d1=numpy.array([[  1001,  1000.],[  1000.,  1001.]]).astype('float32')
d2=numpy.array([[ 10001, 10000.],[ 10000., 10001.]]).astype('float32')
d3=numpy.array([[100001,100000.],[100000.,100001.]]).astype('float32')

for d in [d0,d1,d2,d3]:
  pca=faiss.PCAMatrix(2,2)
  pca.train(d)
  print(faiss.vector_to_array(pca.A).reshape((2,2)), "center", -faiss.vector_to_array(pca.b))

Clearly, for all four we should get a matrix with the primary direction +-sqrt(2) * [1, -1]

As expected faiss' A suffers as expected from catastrophic cancellation in the covariance matrix computation approach used by faiss. But there is also something odd going on with b (but I may just be misinterpreting b).

[[-0.70710677  0.70710677]
 [-0.70710677 -0.70710677]] center [-0.         -0.70710677]
[[-0.70710677  0.70710677]
 [-0.70710677 -0.70710677]] center [   -0.     -1414.9207]
[[0. 1.]
 [1. 0.]] center [10000.5 10000.5]
[[0. 1.]
 [1. 0.]] center [100000.5 100000.5]

for comparison, this is the result with sklearn:

for d in [d0,d1,d2,d3]:
  skpca=sklearn.decomposition.PCA()
  skpca.fit(d)
  print(skpca.components_, "center", skpca.mean_, skpca.components_.dtype)

[[ 0.7071067 -0.7071068]
 [ 0.7071068  0.7071067]] center [0.5 0.5] float32
[[ 0.7071067 -0.7071068]
 [ 0.7071068  0.7071067]] center [1000.5 1000.5] float32
[[ 0.7071067 -0.7071068]
 [ 0.7071068  0.7071067]] center [10000.5 10000.5] float32
[[ 0.7071067 -0.7071068]
 [ 0.7071068  0.7071067]] center [100000.5 100000.5] float32

The cause is the classic (e.g., in Knuth's "The Art of Computer Programming", but for variance only) catastrophic cancellation when computing the covariance using E[XY]-E[X]E[Y] when the last term is not close enough to 0. See also https://en.wikipedia.org/wiki/Covariance#Numerical_computation
We study approaches to better computing this here:
Schubert, Erich, and Michael Gertz. "Numerically stable parallel computation of (co-) variance." Proceedings of the 30th International Conference on Scientific and Statistical Database Management. ACM, 2018.
But a fairly decent approach is to first center the data, then compute the covariances. It's good in precision, and easy to vectorize. That is supposedly what sklearn uses in above example.

kno10 on 9 Apr 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings