Let's say I've performed matches on features that have been detected using your pipeline. Can I export the runtimes for those matches, or do I need to force a recompute of the matching times? This is for benchmarking.
Yet the matching time are not logged.
It could be worth it to add them to the command line report (cout)
And we can also think about generating a matching_report.html file in order to log the matching times & some additional info.
You can use for this purpose the openMVG::system::Timer class.
Does the provided answer was sufficient or do we need to elaborate more?
@pmoulon Yep! That's sufficient. I can also start integrating (just finished up my GPU Brute Force L2 matching code, so I'm going to start on this and my microkernel CUDA architecture next!)
Impatient to see results of the GPU matcher. Perhaps re-organize the pair per thread will allow to have more gain on all platform (CPU/GPU).
Here thread are launched on all the pair... there is no re-ordering.
@mdaiter Perhaps this approach would be valuable for binary descriptor matching too http://arxiv.org/abs/1307.2982
^ @komrad36 @csp256 I think that's your cue ;)
Since there is some opensource code it would be easy to integrate it. @mdaiter @komrad36 @csp256
https://github.com/norouzi/mih
I think @csp256 and @komrad36 have been discussing rolling their own GPU-based MIH matcher. Interested to see the results!
Also @pmoulon I had integrated that repository into an older build of my version of OpenMVG. I think a GPU-based version would probably be far more efficient, as well as a cleaner solution. Their code isn't too clean, and the interface forces a lot of assumptions onto the user (necessary to have NASA-developed filesystem installed, necessary to log, etc.).
@pmoulon HDF5-lib is what came to mind. Here's the issue some users ran into when using it: https://github.com/mdaiter/openMVG/issues/1 . I don't know if we want to use their implementation due to the heavy pre-requisites.
@pmoulon can I make SIFT descriptors into floats from your 128-long unsigned char storage? I've tried multiplying by 1/512 and 1/256, but it isn't working. My matcher only works with floats at the moment, and I was hoping to test SIFT key points with it. I'm currently getting excellent results with my deep descriptor libraries.
perhaps it's a c++ type problem conversion.
You can see that there I multiply by 512. So multiply back by 1.0/512.0
should work.
Print the value to see the results.
You will still have a bit of quantification effect but it should work fine.
Or you can use the GPU memory function to convert uchar data to float directly I think.
@pmoulon just got it working. Seems like there was an issue with the cv::cuda::GpuMat::convertTo function when converting between CV_8U (unsigned char) and CV_32F (32-bit floating point). All seems to work now!
Also, interesting papers. Let me check those out.
Going to search for a GPU-based way of converting data between types (or just roll my own kernel). Afterwards, I should be good to go to push that out and my Torch-based loading mechanism!
Happy to know that it's working now.
Impatient to see some matching timing comparison between the different method ;-)
@pmoulon I don't know anything about OpenMVG's internals, but Kareem & I do have timing information about our improved GPU brute force binary descriptor matcher, which works on 512 bit (LATCH) descriptors. I trust Matthew will be able to integrate it, as the CPU interface to the GPU code is identical as what he has been working with.
On my GTX 970M it can do ~17 billion descriptor comparisons per second. This is over 8 times faster than OpenCV's brute force binary descriptor matcher, which works on descriptors of half the length (256 bits; what ORB uses).
On a GTX 1080 our matcher sees a further ~2.6 times speedup relative to my GTX 970M. That means that with a 1080 you can exactly match image pairs, each with 27,000 LATCH descriptors, at 60 image pairs per second. (19,000 keypoints if you check for symmetric matches)
The GPU MIH matcher has been designed but there are some kinks. Due to the fickle nature of GPU's we might not even see a speedup! We will implement it and report back, but it is slightly deprioritized.
For now we are trying to get GPU LATCH faster than GPU ORB descriptor extraction. Currently we are 20% slower (600 nanoseconds / descriptor versus 500 nanoseconds / descriptor on 970M), but have identified several ways to improve the speed.
Thank you @csp256 for the reporting (blazing fast).
Having a generic binary/scalar GPU matcher would be a great addition to the lib.
Do you think that the code can be change to work under OpenCL?
@pmoulon I think I can get my Brute Force L2 matcher working under OpenCL (it'll be a good exercise ;-) ). The one aspect that would worry me is the dynamic parallelism I've begun implementing for the matcher. My current matcher is taken from the OpenCV library, so that shouldn't be hard to port.
Some sample time comparisons for deep learned features:
256-bit floating point features:
CognacStJaquesDoor: 7033.8 seconds (CPU matcher) -> 188.763 seconds (GPU matcher) (~614k feature descriptions)
PoitiersMainCathedralDoor: 18444.7 seconds (CPU matcher) -> 514.699 seconds (GPU matcher) (~1,080k feature descriptions)
Thank you @mdaiter for the report.
We can test later on classic SIFT descriptor.
Which CPU method did you consider for your test (BruteForce, ANNL2, CascadeHashing)?
Happy to see you have used the @rperrot datasets!
@pmoulon I used the BruteForce method on the CPU and GPU.
I've already tested on the classic SIFT descriptor as well. It works like a charm.
I have all of the matching results for the @rperrot datasets, if you'd like!
Also, could you please give us an idea of your hardware (GPU, CPU)
@ORNis GTX 1080 + Core i7-3770 CPU @ 3.40GHz + 32 GB RAM
Thank you.
@pmoulon I really attempted to abstract the matcher away from any specific constraints as much as possible; therefore, the GPU Brute Force L2 matching class is abstracted as a template class, through which you can input lengths of descriptors and types of descriptors. From there, you can easily create new abstractions for new descriptors. Just let me know which ones you want!
We will need SIFT(128 uchar & float) & SURF(64 float) like regions, plus the ones you need for your experiments.
I have also a tiny descriptor (20 float) that can be used for feature tracking in video sequences (https://github.com/openMVG/openMVG/blob/develop/src/software/VO/Tracker.hpp#L41)
So the following must be ok:
// SIFT like Regions
GPUBruteForceL2Matcher<unsigned char, 128>;
GPUBruteForceL2Matcher<float, 128>;
// SURF like Regions
GPUBruteForceL2Matcher<float, 64>;
// Dissociated dipole like Regions
GPUBruteForceL2Matcher<float, 20>;
// DeepDescriptor like Regions
GPUBruteForceL2Matcher<float, 256>;
GPUBruteForceL2Matcher<float, 512>;
We must handle the code that do the dispatch at the right moment and also handle the case where the asked configuration is not yet handled (warning message).
Regarding performance (on a collection of dataset), we will do that when we will be close to the final PR state.
We can imagine that a tool is coming with the PR or beside in order to compare the different matching method in term of repeatability (measure how close the resulting matches are close to classic BruteForce) and performance (timing).
@pmoulon my GPU L2 Brute Force matching code is now open! Feel free to toy around with it!
Thanks.
I will try to play with it.
Did you plan to make an OpenCV free version (I mean rewrite a gpu matcher kernel that does not depends on OpenCV)?
@pmoulon Yep! I'm working on it right now.
@pmoulon A GPU matcher can be implemented in OpenCL, however my code can not be ported. This is because my performance comes from using the low-level warp shuffle intrinsic operations in a clever way. From what I understand these intrinsics are not exposed to OpenCL. I would expect more than an order of magnitude slow down by using OpenCL.
I talked with @mdaiter about his GPU L2 matcher and we believe I can speed it up significantly. I will try making a brute force L2 CUDA matcher today. I understand that you need 512, 256, 128, and 20 float matchers? (sadly, the secret-sauce in my fast matcher is not particularly template-friendly; this may change though)
SIFT descriptors are converted to uint8's, right? It is entirely possible that leaving them as float32's will be faster on the GPU, even with the extra memory use.
This might sound strange, but the 20 float descriptor is almost counter productive for GPU work. Not being a power of two hurts it, and too small data underutilizes the GPU. I will have to handle it differently, but it shouldn't be a problem.
All of my reported times are assuming you have a significantly large problem. SfM at the scales @mdaiter is doing it is such a problem. However, if you are matching 500 descriptors frame to frame... well, that doesn't even saturate a single kernel block. It'll still be very fast, and the absolute run time very low, but the efficiency doesn't scale down. To prevent up to an order of magnitude efficiency decrease you would have to make the GPU be able to work on several such frame pairs at once.
These are mostly pedantic points, but because they are somewhat counterintuitive I thought I would raise them.
What di you man by shuffling ? Does this is usable :
https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/shuffle.html
? I know cuda is cool but it's hardware and builder dependent and impossible to use on some platforms.
This might sound strange, but the 20 float descriptor is almost counter productive for GPU work. Not being a power of two hurts it, and too small data underutilizes the GPU. I will have to handle it differently, but it shouldn't be a problem.
Why not having a function that maps a descriptor of any length to a descriptor with a size that is the next power of two ?
@pmoulon is it possible to split this thread or to create a new thread ?
@rperrot What OpenCL calls a shuffle is something entirely different. I quote from the article I linked above:
On GPUs, work-groups are executed in what are called warps or wave-fronts, and most modern GPUs can in fact exchange data between work-items in the same warp using specific shuffle intrinsics (which have nothing to do with the OpenCL C shuffle function): these intrinsics allow work-items to access the private registers of other work-items in the same warp. While warps in the same work-group still have to communicate using local memory, a simple reduction algorithm can thus be implemented using warp shuffle instructions and only requiring one word of local memory per warp, rather than one per work-item, which can lead to better hardware utilization (e.g. by allowing more work-groups per compute unit thanks to the reduced use of local memory).
The type of shuffle operation I am using is on slide seven. Specifically, I am using a very clever, optimal modification of that reduction pattern. To my knowledge OpenCL outright lacks the capability to utilize the hardware in this way at all, nevermind as cleverly as I have.
If someone else is willing to add an OpenCL matcher, that is great (concerns about vendor-lock have much merit), but it will be significantly slower. In fact, that person should contact me when that day comes, as I can show clever local memory access patterns can minimize reduction overhead.
Part of what I meant is that if your descriptor will gain discriminative power by outputting 32 floats instead of 20, then you should do that. The matcher will be equally fast for all descriptors of length 17 to 32 (even slightly faster for 32).
@csp256, @mdaiter , @rperrot
Here a new thread dedicated to GPU nearest neighbor matching
https://github.com/openMVG/openMVG/issues/603
As @rperrot I think it's time to summarize our notes in a new dedicated conversation.
Regarding genericity, I think that from a reproducible research point of view, having a code that is working for all descriptor sizes is better than a code that is optimized for a specific given size (perhaps a code that is so tuned that it is hard to understand). But we can discuss all our life long on this point (speed/genericity/readibility)...
I mean it's ok to understand that the code is "slow" for very small descriptor if the code is generic, since it's also understandable due to the GPU architecture.
I know that 20 is a very small size, I listed all the sizes that can be used actually by the lib. It helps us to see the potential bottleneck.
Regarding implementation we can think to first introduce a CUDA based matcher and later on add an OpenCL one (even it is not as fast), since as @rperrot said, some people can be interested by the OpenCL capability (usage on various platform).
Most helpful comment
Happy to know that it's working now.
Impatient to see some matching timing comparison between the different method ;-)