Here are two-sample, two-dimensional test cases (designed for fp32, but you can design similar cases for fp64, too):
d0=numpy.array([[ 1, 0.],[ 0., 1.]]).astype('float32')
d1=numpy.array([[ 1001, 1000.],[ 1000., 1001.]]).astype('float32')
d2=numpy.array([[ 10001, 10000.],[ 10000., 10001.]]).astype('float32')
d3=numpy.array([[100001,100000.],[100000.,100001.]]).astype('float32')
for d in [d0,d1,d2,d3]:
pca=faiss.PCAMatrix(2,2)
pca.train(d)
print(faiss.vector_to_array(pca.A).reshape((2,2)), "center", -faiss.vector_to_array(pca.b))
Clearly, for all four we should get a matrix with the primary direction +-sqrt(2) * [1, -1]
As expected faiss' A suffers as expected from catastrophic cancellation in the covariance matrix computation approach used by faiss. But there is also something odd going on with b (but I may just be misinterpreting b).
[[-0.70710677 0.70710677]
[-0.70710677 -0.70710677]] center [-0. -0.70710677]
[[-0.70710677 0.70710677]
[-0.70710677 -0.70710677]] center [ -0. -1414.9207]
[[0. 1.]
[1. 0.]] center [10000.5 10000.5]
[[0. 1.]
[1. 0.]] center [100000.5 100000.5]
for comparison, this is the result with sklearn:
for d in [d0,d1,d2,d3]:
skpca=sklearn.decomposition.PCA()
skpca.fit(d)
print(skpca.components_, "center", skpca.mean_, skpca.components_.dtype)
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [0.5 0.5] float32
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [1000.5 1000.5] float32
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [10000.5 10000.5] float32
[[ 0.7071067 -0.7071068]
[ 0.7071068 0.7071067]] center [100000.5 100000.5] float32
The cause is the classic (e.g., in Knuth's "The Art of Computer Programming", but for variance only) catastrophic cancellation when computing the covariance using E[XY]-E[X]E[Y] when the last term is not close enough to 0. See also https://en.wikipedia.org/wiki/Covariance#Numerical_computation
We study approaches to better computing this here:
Schubert, Erich, and Michael Gertz. "Numerically stable parallel computation of (co-) variance." Proceedings of the 30th International Conference on Scientific and Statistical Database Management. ACM, 2018.
But a fairly decent approach is to first center the data, then compute the covariances. It's good in precision, and easy to vectorize. That is supposedly what sklearn uses in above example.
_Originally posted by @kno10 in https://github.com/facebookresearch/faiss/issues/400#issuecomment-481033695_
Thanks for the expert analysis @kno10 !
The reason why I used E[XY]-E[X]E[Y] is to avoid duplicating the data in RAM (and of course computing it in double precision).
But of course a solution with a better numerical behavior would be better.
In the mentioned paper we discuss various numerically stable approaches, but I am not aware of an easy way to do these with BLAS. Single precision should usually be okay if your input data is single precision only. Vectorization with AVX etc. is possible, but I only have implemented this for variance, not covariance; the savings with covariance likely are much less because we're interested in d x d combinations and only have few registers. Multithreading is straightforward, though; split the data, compute a covariance matrix for each partition with any stable incremental approach, and combine with the weighted merge equations.
Can you confirm b is indeed the negative mean implemented in sklearn please?
@don-tpanic it clearly isn't.
It's probably b as in y = A * x + b, not as in y = A * (x - mu), so it's probably b = A^T * -mu
for d in [d0,d1,d2,d3]:
pca=faiss.PCAMatrix(2,2)
pca.train(d)
A = faiss.vector_to_array(pca.A).reshape((2,2))
mu = -A.T.dot(faiss.vector_to_array(pca.b))
print(A, "center", mu)
[[-0.70710677 0.70710677]
[-0.70710677 -0.70710677]] center [0.49999997 0.49999997]
[[-0.70710677 0.70710677]
[-0.70710677 -0.70710677]] center [1000.5 1000.5]
[[0. 1.]
[1. 0.]] center [10000.5 10000.5]
[[0. 1.]
[1. 0.]] center [100000.5 100000.5]
As you can see, the center is okay, the catastrophic cancellation only happens in the covariance matrix.
Most helpful comment
@don-tpanic it clearly isn't.
It's probably
bas iny = A * x + b, not as iny = A * (x - mu), so it's probablyb = A^T * -muAs you can see, the center is okay, the catastrophic cancellation only happens in the covariance matrix.