We search based on attributes and use one index per attribute following the Faiss Wiki. Related to this, we need to know which IDs are saved with each attribute. We could build a separate data structure to track this, but that adds memory usage, whereas our IDMap2,Flat index already holds all the IDs.
Sadly, IndexIDMap2 has no function for getting the IDs. Would it be possible to add a get_ids() function that returns all the IDs?
That would be amazing!
Interface:
My C++ is quite rusty and I haven't dug much into this, but it seems like get_ids() would just need to return id_map in IndexIDMapTemplate. Based on https://github.com/facebookresearch/faiss/blob/master/Index.h#L53, it seems like returning this vector should yield a numpy-equivalent int64 array in Python.
If I tried to implement this C++ function, would I need to do anything else to make it work with SWIG and expose it as a Python function?
This is not necessary, to get the ids you can use this code
I see, thanks. I tried looking through examples/tutorials for something like that, but couldn't find it. I ended up coding up the get_ids() methods as seen in https://github.com/rune/faiss/commit/01fb5076dd2e0ff9e6f2ad43142de5d9d1908937.
I think a function like get_ids() would be more developer-friendly. To come up with the code in your gist on your own, you'd have to know that:
1) id_map contains the id in a vector data structure (can only be seen in C++ source code)
2) vector_to_array function exists and that you need to return an array to get numpy data in Python
As a primarily Python developer, these steps are non-trivial. I tried various things to use id_map.index_map from Python, but would never have guessed that I had to convert the vector to an array w/o looking at the source code. Honestly, I could barely recall the difference between the two as I haven't used C++ for years 馃槄
Would you accept a PR adding get_ids()?
No. I don't want to add redundant functions that will have to be maintained. I will add the code snippet above to the FAQ.
Sorry for the incomplete documentation, but we don't have the resources to go to that level of detail.
Okay, I understand the need for keeping things simple. However, asking developers to rely on C++ internals seems like it could cause a lot of issues downstream.
For instance, I wouldn't be surprised if those internals changed without the FAQ getting updated at the same time. There's lots of ML libraries (e.g. Tensorflow), where their tutorials/examples ask developers to use internals, which no longer work (or worse - are broken in hidden ways). It might actually be less work to add the function than to keep the FAQ/examples in sync :)
Additionally, we and others using the FAQ snippet would have to review every release thoroughly to check that the underlying C++ implementation didn't change, which makes upgrading much more painful.
If you write a PR with a test from the code snippet, I'd accept it. You can add it to
Okay, that sounds good. I'll do that.
Another option could be expose a get_ids() function through the Python interface as a wrapper around vector_to_array(self.id_map). That way, it doesn't clutter the C++ code and interface. At the same time, it's easy to see when using IndexIDMap in Python.
It is an explicit design decision to not clutter the Faiss code with getters and setters.
Makes sense. Last idea and then I promise I won't ask you more about this :)
Would it make sense to have SWIG map all C++ vectors to numpy arrays using vector_to_array?
All the exposed Faiss functions return numpy arrays so it's confusing that some class variables are returned as vectors, which you can't use directly and have to figure out how to use. Furthermore, vector_to_array is currently used 20 times in bench/test. It seems that using this mapping would simplify bench/test and simplify the life of future Python devs.
I can't get your gist to work on my MacOS laptop. Any idea why?
Code:
import numpy as np
import faiss
print(faiss.__version__)
index = faiss.IndexIDMap2(faiss.IndexFlatL2(32))
ids = np.random.randint(0, 5000, size=10)
x = np.random.rand(10, 32).astype("float32")
index.add_with_ids(x, ids)
print(faiss.vector_to_array(index.id_map))
On 1.6.1 (installed through conda):
Traceback (most recent call last):
File "faiss_test.py", line 12, in <module>
print(faiss.vector_to_array(index.id_map))
File "/usr/local/anaconda3/envs/aura/lib/python3.6/site-packages/faiss/__init__.py", line 507, in vector_to_array
assert classname.endswith('Vector')
AssertionError
On 1.6.3 (compiled locally):
Traceback (most recent call last):
File "faiss_test.py", line 12, in <module>
print(faiss.vector_to_array(index.id_map))
File "src/faiss/__init__.py", line 565, in vector_to_array
dtype = np.dtype(vector_name_map[classname[:-6]])
KeyError: 'LongLong'
Your gist works on 1.6.3 on Ubuntu 16.04. It's unclear to me what the difference would be. I have not identified any other Faiss functionality that doesn't work on my MacOS laptop. It might be related to https://github.com/facebookresearch/faiss/issues/1231.
Right, this is an issue with the mac we've seen before #1020
I can't get your gist to work on my MacOS laptop. Any idea why?
Code:
import numpy as np import faiss print(faiss.__version__) index = faiss.IndexIDMap2(faiss.IndexFlatL2(32)) ids = np.random.randint(0, 5000, size=10) x = np.random.rand(10, 32).astype("float32") index.add_with_ids(x, ids) print(faiss.vector_to_array(index.id_map))On 1.6.1 (installed through conda):
Traceback (most recent call last): File "faiss_test.py", line 12, in <module> print(faiss.vector_to_array(index.id_map)) File "/usr/local/anaconda3/envs/aura/lib/python3.6/site-packages/faiss/__init__.py", line 507, in vector_to_array assert classname.endswith('Vector') AssertionErrorOn 1.6.3 (compiled locally):
Traceback (most recent call last): File "faiss_test.py", line 12, in <module> print(faiss.vector_to_array(index.id_map)) File "src/faiss/__init__.py", line 565, in vector_to_array dtype = np.dtype(vector_name_map[classname[:-6]]) KeyError: 'LongLong'
I am facing this issue as well, on macOS Catalina 10.15.6.
The following code snippet works for me though, for anybody facing a similar problem:
ids = [index.id_map.at(int(i)) for i in range(index.ntotal)]
I am facing this issue as well, on macOS Catalina 10.15.6.
The following code snippet works for me though, for anybody facing a similar problem:
ids = [index.id_map.at(int(i)) for i in range(index.ntotal)]
Thanks for sharing this. I'd imagine though that this solution is a lot slower as it does a Python -> C++ call for each element in the index?
Yes it is much slower. I will try to repro on a mac.