Faiss: Vector normalization

Created on 19 Mar 2017  路  3Comments  路  Source: facebookresearch/faiss

Hello,

Are the input vectors expected to be approximately of the same norm? I didn't see such requirements in the README but in practice I have had trouble if the norms of the vector vary too much.

Thanks in advance,

question

Most helpful comment

Hi,

This is a good question.

I assume you mean you have varying norms within a set of vectors that are encoded with a ProductQuantizer in IndexPQ or in an IndexIVF.

The objective of the quantizer is to minimize the squared reconstruction error, ie.

E = min sum_{x\in X} || x - r(c(x)) ||^2

where x is the set of vectors to encode, c() and r() are the encoding and decoding functions respectively, assuming X is both the training set and the set to be encoded.

This optimization objective shows that when there are vectors with very small norms, the encoder will "find it easy" to just encode them as a 0 vector.

For example, assume X = {m*e_1, ..., m* e_10, M * e_1, ..., M*e_10 }, with scalars m << M and e_1, e_10 ten unit vectors. Assume the codec has 11 codes available. Then it will just assign codeword 0 for m*e_1, ...., m*e_10 and allocate one codeword for each of M*e_1, ..., M*e_10, because this will minimize E=10 * m^2

The effects are:

  • on PQ-encoded vectors, the relative reconstruction erroras measured with ||x - r(c(x))||^2 / ||x||^2, is bad

  • on in inverted indexes (IndexIVF classes, "IMI" or "IVF" factory strings) all vectors with small norm are assigned to the same inverted list. This makes the data structure very unbalanced, which slows down searches.

To remedy this, please look at the application context. For example, it may be useful to normalize the vectors before compression and store the norms separately. This has been used in the work https://arxiv.org/pdf/1612.03651.pdf

All 3 comments

Hi,

This is a good question.

I assume you mean you have varying norms within a set of vectors that are encoded with a ProductQuantizer in IndexPQ or in an IndexIVF.

The objective of the quantizer is to minimize the squared reconstruction error, ie.

E = min sum_{x\in X} || x - r(c(x)) ||^2

where x is the set of vectors to encode, c() and r() are the encoding and decoding functions respectively, assuming X is both the training set and the set to be encoded.

This optimization objective shows that when there are vectors with very small norms, the encoder will "find it easy" to just encode them as a 0 vector.

For example, assume X = {m*e_1, ..., m* e_10, M * e_1, ..., M*e_10 }, with scalars m << M and e_1, e_10 ten unit vectors. Assume the codec has 11 codes available. Then it will just assign codeword 0 for m*e_1, ...., m*e_10 and allocate one codeword for each of M*e_1, ..., M*e_10, because this will minimize E=10 * m^2

The effects are:

  • on PQ-encoded vectors, the relative reconstruction erroras measured with ||x - r(c(x))||^2 / ||x||^2, is bad

  • on in inverted indexes (IndexIVF classes, "IMI" or "IVF" factory strings) all vectors with small norm are assigned to the same inverted list. This makes the data structure very unbalanced, which slows down searches.

To remedy this, please look at the application context. For example, it may be useful to normalize the vectors before compression and store the norms separately. This has been used in the work https://arxiv.org/pdf/1612.03651.pdf

Closing issue, as the question seems to be answered.

@mdouze hi. If i use the hnsw index. is it will be better to use the l2 norm to the vector ?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

linghuang picture linghuang  路  3Comments

minjiaz picture minjiaz  路  3Comments

xxllp picture xxllp  路  3Comments

siva2k16 picture siva2k16  路  3Comments

brunodoamaral picture brunodoamaral  路  3Comments