Faiss: Get IDs from IndexIDMap2

Created on 20 May 2020  路  15Comments  路  Source: facebookresearch/faiss

Summary

We search based on attributes and use one index per attribute following the Faiss Wiki. Related to this, we need to know which IDs are saved with each attribute. We could build a separate data structure to track this, but that adds memory usage, whereas our IDMap2,Flat index already holds all the IDs.

Sadly, IndexIDMap2 has no function for getting the IDs. Would it be possible to add a get_ids() function that returns all the IDs?

That would be amazing!

Interface:

  • [ ] C++
  • [x] Python
duplicate question

All 15 comments

My C++ is quite rusty and I haven't dug much into this, but it seems like get_ids() would just need to return id_map in IndexIDMapTemplate. Based on https://github.com/facebookresearch/faiss/blob/master/Index.h#L53, it seems like returning this vector should yield a numpy-equivalent int64 array in Python.

If I tried to implement this C++ function, would I need to do anything else to make it work with SWIG and expose it as a Python function?

This is not necessary, to get the ids you can use this code

I see, thanks. I tried looking through examples/tutorials for something like that, but couldn't find it. I ended up coding up the get_ids() methods as seen in https://github.com/rune/faiss/commit/01fb5076dd2e0ff9e6f2ad43142de5d9d1908937.

I think a function like get_ids() would be more developer-friendly. To come up with the code in your gist on your own, you'd have to know that:
1) id_map contains the id in a vector data structure (can only be seen in C++ source code)
2) vector_to_array function exists and that you need to return an array to get numpy data in Python

As a primarily Python developer, these steps are non-trivial. I tried various things to use id_map.index_map from Python, but would never have guessed that I had to convert the vector to an array w/o looking at the source code. Honestly, I could barely recall the difference between the two as I haven't used C++ for years 馃槄

Would you accept a PR adding get_ids()?

No. I don't want to add redundant functions that will have to be maintained. I will add the code snippet above to the FAQ.
Sorry for the incomplete documentation, but we don't have the resources to go to that level of detail.

Okay, I understand the need for keeping things simple. However, asking developers to rely on C++ internals seems like it could cause a lot of issues downstream.

For instance, I wouldn't be surprised if those internals changed without the FAQ getting updated at the same time. There's lots of ML libraries (e.g. Tensorflow), where their tutorials/examples ask developers to use internals, which no longer work (or worse - are broken in hidden ways). It might actually be less work to add the function than to keep the FAQ/examples in sync :)

Additionally, we and others using the FAQ snippet would have to review every release thoroughly to check that the underlying C++ implementation didn't change, which makes upgrading much more painful.

If you write a PR with a test from the code snippet, I'd accept it. You can add it to

https://github.com/facebookresearch/faiss/blob/22b7876ef5540b85feee173aa3182a2f37dc98f6/tests/test_meta_index.py

Okay, that sounds good. I'll do that.

Another option could be expose a get_ids() function through the Python interface as a wrapper around vector_to_array(self.id_map). That way, it doesn't clutter the C++ code and interface. At the same time, it's easy to see when using IndexIDMap in Python.

It is an explicit design decision to not clutter the Faiss code with getters and setters.

Makes sense. Last idea and then I promise I won't ask you more about this :)

Would it make sense to have SWIG map all C++ vectors to numpy arrays using vector_to_array?

All the exposed Faiss functions return numpy arrays so it's confusing that some class variables are returned as vectors, which you can't use directly and have to figure out how to use. Furthermore, vector_to_array is currently used 20 times in bench/test. It seems that using this mapping would simplify bench/test and simplify the life of future Python devs.

I can't get your gist to work on my MacOS laptop. Any idea why?

Code:

import numpy as np
import faiss

print(faiss.__version__)
index = faiss.IndexIDMap2(faiss.IndexFlatL2(32))
ids = np.random.randint(0, 5000, size=10)
x = np.random.rand(10, 32).astype("float32")
index.add_with_ids(x, ids)
print(faiss.vector_to_array(index.id_map))

On 1.6.1 (installed through conda):

Traceback (most recent call last):
  File "faiss_test.py", line 12, in <module>
    print(faiss.vector_to_array(index.id_map))
  File "/usr/local/anaconda3/envs/aura/lib/python3.6/site-packages/faiss/__init__.py", line 507, in vector_to_array
    assert classname.endswith('Vector')
AssertionError

On 1.6.3 (compiled locally):

Traceback (most recent call last):
  File "faiss_test.py", line 12, in <module>
    print(faiss.vector_to_array(index.id_map))
  File "src/faiss/__init__.py", line 565, in vector_to_array
    dtype = np.dtype(vector_name_map[classname[:-6]])
KeyError: 'LongLong'

Your gist works on 1.6.3 on Ubuntu 16.04. It's unclear to me what the difference would be. I have not identified any other Faiss functionality that doesn't work on my MacOS laptop. It might be related to https://github.com/facebookresearch/faiss/issues/1231.

Right, this is an issue with the mac we've seen before #1020

I can't get your gist to work on my MacOS laptop. Any idea why?

Code:

import numpy as np
import faiss

print(faiss.__version__)
index = faiss.IndexIDMap2(faiss.IndexFlatL2(32))
ids = np.random.randint(0, 5000, size=10)
x = np.random.rand(10, 32).astype("float32")
index.add_with_ids(x, ids)
print(faiss.vector_to_array(index.id_map))

On 1.6.1 (installed through conda):

Traceback (most recent call last):
  File "faiss_test.py", line 12, in <module>
    print(faiss.vector_to_array(index.id_map))
  File "/usr/local/anaconda3/envs/aura/lib/python3.6/site-packages/faiss/__init__.py", line 507, in vector_to_array
    assert classname.endswith('Vector')
AssertionError

On 1.6.3 (compiled locally):

Traceback (most recent call last):
  File "faiss_test.py", line 12, in <module>
    print(faiss.vector_to_array(index.id_map))
  File "src/faiss/__init__.py", line 565, in vector_to_array
    dtype = np.dtype(vector_name_map[classname[:-6]])
KeyError: 'LongLong'

I am facing this issue as well, on macOS Catalina 10.15.6.
The following code snippet works for me though, for anybody facing a similar problem:

ids = [index.id_map.at(int(i)) for i in range(index.ntotal)]

I am facing this issue as well, on macOS Catalina 10.15.6.
The following code snippet works for me though, for anybody facing a similar problem:

ids = [index.id_map.at(int(i)) for i in range(index.ntotal)]

Thanks for sharing this. I'd imagine though that this solution is a lot slower as it does a Python -> C++ call for each element in the index?

Yes it is much slower. I will try to repro on a mac.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jukaradayi picture jukaradayi  路  3Comments

daniellevy picture daniellevy  路  3Comments

lukedeo picture lukedeo  路  3Comments

danny1984 picture danny1984  路  3Comments

linghuang picture linghuang  路  3Comments