[question]
Was trying to implement Index pre-transform for reducing dimensions from 512 to 128
OS: Ubuntu 16.04
Running on :
Randomly generated data of dimensions 512 , reduced to 128 using PCA matrix with IVFPQ index with polysemous training , ht=512 and nprobe = 16.
The size of the Index on disk rose ( using faiss.write_index ) while the accuracy dropped , both significantly (Size almost as big as 75% of the input data and the accuracy < 6%)
Is this an expected outcome ?
Hi
The drop in accuracy is expected since the dimension of uniform random data cannot be reduced in a meaningful way.
The size of the index increases by a fixed amount: the size of the PCA matrix.
Hi ,
I did expect reduction in accuracy but not so significantly that it kinda doesnt make sense to incorporate it.
I will close this as I think it I need to work more on its tuning to maybe get something more than negligible accuracy on a small toy set
Thanks !
You may want to try out doing PCA yourself.
From a quick look, the current faiss implementation may have a severe numerical precision issue in PCA. I just don't know when I'll have time to do prepare a simple data set to demonstrate the problem.
@kno10 curious on what dataset the numerical precision appears.
https://github.com/facebookresearch/faiss/wiki/Implementation-notes#pca-matrix-computation
Alternative PCA implementation:
https://github.com/facebook/fbpca
Here are two-sample, two-dimensional test cases (designed for fp32, but you can design similar cases for fp64, too):
d0=numpy.array([[ 1, 0.],[ 0., 1.]]).astype('float32')
d1=numpy.array([[ 1001, 1000.],[ 1000., 1001.]]).astype('float32')
d2=numpy.array([[ 10001, 10000.],[ 10000., 10001.]]).astype('float32')
d3=numpy.array([[100001,100000.],[100000.,100001.]]).astype('float32')
for d in [d0,d1,d2,d3]:
pca=faiss.PCAMatrix(2,2)
pca.train(d)
print(faiss.vector_to_array(pca.A).reshape((2,2)), "center", -faiss.vector_to_array(pca.b))
Clearly, for all four we should get a matrix with the primary direction +-sqrt(2) * [1, -1]
As expected faiss' A suffers as expected from catastrophic cancellation in the covariance matrix computation approach used by faiss. But there is also something odd going on with b (but I may just be misinterpreting b).
[[-0.70710677 0.70710677]
[-0.70710677 -0.70710677]] center [-0. -0.70710677]
[[-0.70710677 0.70710677]
[-0.70710677 -0.70710677]] center [ -0. -1414.9207]
[[0. 1.]
[1. 0.]] center [10000.5 10000.5]
[[0. 1.]
[1. 0.]] center [100000.5 100000.5]
for comparison, this is the result with sklearn:
for d in [d0,d1,d2,d3]:
skpca=sklearn.decomposition.PCA()
skpca.fit(d)
print(skpca.components_, "center", skpca.mean_, skpca.components_.dtype)
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [0.5 0.5] float32
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [1000.5 1000.5] float32
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [10000.5 10000.5] float32
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [100000.5 100000.5] float32
The cause is the classic (e.g., in Knuth's "The Art of Computer Programming", but for variance only) catastrophic cancellation when computing the covariance using E[XY]-E[X]E[Y] when the last term is not close enough to 0. See also https://en.wikipedia.org/wiki/Covariance#Numerical_computation
We study approaches to better computing this here:
Schubert, Erich, and Michael Gertz. "Numerically stable parallel computation of (co-) variance." Proceedings of the 30th International Conference on Scientific and Statistical Database Management. ACM, 2018.
But a fairly decent approach is to first center the data, then compute the covariances. It's good in precision, and easy to vectorize. That is supposedly what sklearn uses in above example.