Faiss: Vector normalization

Created on 19 Mar 2017 · 3Comments · Source: facebookresearch/faiss

Hello,

Are the input vectors expected to be approximately of the same norm? I didn't see such requirements in the README but in practice I have had trouble if the norms of the vector vary too much.

Thanks in advance,

question

Source

daniellevy

Most helpful comment

Hi,

This is a good question.

I assume you mean you have varying norms within a set of vectors that are encoded with a ProductQuantizer in IndexPQ or in an IndexIVF.

The objective of the quantizer is to minimize the squared reconstruction error, ie.

E = min sum_{x\in X} || x - r(c(x)) ||^2

where x is the set of vectors to encode, c() and r() are the encoding and decoding functions respectively, assuming X is both the training set and the set to be encoded.

This optimization objective shows that when there are vectors with very small norms, the encoder will "find it easy" to just encode them as a 0 vector.

For example, assume X = {m*e_1, ..., m* e_10, M * e_1, ..., M*e_10 }, with scalars m << M and e_1, e_10 ten unit vectors. Assume the codec has 11 codes available. Then it will just assign codeword 0 for m*e_1, ...., m*e_10 and allocate one codeword for each of M*e_1, ..., M*e_10, because this will minimize E=10 * m^2

The effects are:

on PQ-encoded vectors, the relative reconstruction erroras measured with ||x - r(c(x))||^2 / ||x||^2, is bad
on in inverted indexes (IndexIVF classes, "IMI" or "IVF" factory strings) all vectors with small norm are assigned to the same inverted list. This makes the data structure very unbalanced, which slows down searches.

To remedy this, please look at the application context. For example, it may be useful to normalize the vectors before compression and store the norms separately. This has been used in the work https://arxiv.org/pdf/1612.03651.pdf

mdouze on 20 Mar 2017

👍2

All 3 comments

Hi,