Hello,
Are the input vectors expected to be approximately of the same norm? I didn't see such requirements in the README but in practice I have had trouble if the norms of the vector vary too much.
Thanks in advance,
Hi,
This is a good question.
I assume you mean you have varying norms within a set of vectors that are encoded with a ProductQuantizer in IndexPQ or in an IndexIVF.
The objective of the quantizer is to minimize the squared reconstruction error, ie.
E = min sum_{x\in X} || x - r(c(x)) ||^2
where x is the set of vectors to encode, c() and r() are the encoding and decoding functions respectively, assuming X is both the training set and the set to be encoded.
This optimization objective shows that when there are vectors with very small norms, the encoder will "find it easy" to just encode them as a 0 vector.
For example, assume X = {m*e_1, ..., m* e_10, M * e_1, ..., M*e_10 }, with scalars m << M and e_1, e_10 ten unit vectors. Assume the codec has 11 codes available. Then it will just assign codeword 0 for m*e_1, ...., m*e_10 and allocate one codeword for each of M*e_1, ..., M*e_10, because this will minimize E=10 * m^2
The effects are:
on PQ-encoded vectors, the relative reconstruction erroras measured with ||x - r(c(x))||^2 / ||x||^2, is bad
on in inverted indexes (IndexIVF classes, "IMI" or "IVF" factory strings) all vectors with small norm are assigned to the same inverted list. This makes the data structure very unbalanced, which slows down searches.
To remedy this, please look at the application context. For example, it may be useful to normalize the vectors before compression and store the norms separately. This has been used in the work https://arxiv.org/pdf/1612.03651.pdf
Closing issue, as the question seems to be answered.
@mdouze hi. If i use the hnsw index. is it will be better to use the l2 norm to the vector ?
Most helpful comment
Hi,
This is a good question.
I assume you mean you have varying norms within a set of vectors that are encoded with a
ProductQuantizerinIndexPQor in anIndexIVF.The objective of the quantizer is to minimize the squared reconstruction error, ie.
E = min sum_{x\in X} || x - r(c(x)) ||^2
where x is the set of vectors to encode, c() and r() are the encoding and decoding functions respectively, assuming X is both the training set and the set to be encoded.
This optimization objective shows that when there are vectors with very small norms, the encoder will "find it easy" to just encode them as a 0 vector.
For example, assume
X = {m*e_1, ..., m* e_10, M * e_1, ..., M*e_10 }, with scalarsm << Mande_1, e_10ten unit vectors. Assume the codec has 11 codes available. Then it will just assign codeword 0 form*e_1, ...., m*e_10and allocate one codeword for each ofM*e_1, ..., M*e_10, because this will minimizeE=10 * m^2The effects are:
on PQ-encoded vectors, the relative reconstruction erroras measured with ||x - r(c(x))||^2 / ||x||^2, is bad
on in inverted indexes (
IndexIVFclasses, "IMI" or "IVF" factory strings) all vectors with small norm are assigned to the same inverted list. This makes the data structure very unbalanced, which slows down searches.To remedy this, please look at the application context. For example, it may be useful to normalize the vectors before compression and store the norms separately. This has been used in the work https://arxiv.org/pdf/1612.03651.pdf