Create hardware_concurrency() number of threads and divide each signature set between the threads.
We probably don't want to divide trivial number of verifications, so some good lower-level cutoff where we'd only use one thread.
Added a pull request for this #1611, it's threading the nano::signature_checker::verify function based on std::thread::hardware_concurrency() number of threads.
Not sure what should be set as a trivial number of verifications, I just used the number of the many test case. Added a new test case as well with +1 verifications to trigger the multi-threaded version of the code.
Seeing around a 2x speedup on my 4 core laptop.
./core_test --gtest_filter=signature*
Running main() from gtest_main.cc
Note: Google Test filter = signature*
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from signature_checker
[ RUN ] signature_checker.empty
[ OK ] signature_checker.empty (0 ms)
[ RUN ] signature_checker.many
[ OK ] signature_checker.many (35 ms)
[ RUN ] signature_checker.many_threaded
[ OK ] signature_checker.many_threaded (17 ms)
[ RUN ] signature_checker.one
[ OK ] signature_checker.one (0 ms)
[----------] 4 tests from signature_checker (52 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (52 ms total)
[ PASSED ] 4 tests.
@termoose thanks for the effort! This is something we would like to get in soon, so let me know if you need any help with the PR remarks. In essence we want a thread pool, where each thread is operating on 256 batched chunks.
Hi thanks for the feedback! Will run some benchmarks with 256 sized chunks and update the the branch after testing it out. What鈥檚 the most common case? >256 verifications or less? I want to put the most common case first so we can save some cycles on returning early from this function.
Hard to say. When bootstrapping from 0 there will be a lot, but in real-time probably not that many (at least yet). I think this task is to improve bootstrapping so I would give that priority.
Closed the old PR, opened a new one #1614. It might be a bit faster than spawning threads each time, around 2.2x speedup on my 4 cores. We could experiment with the batch_size as well as the cut-off value for doing the single-threaded version. The current values seem fine from the tests I ran locally.
8 cores results with #1614 : 2x from 1000 (single) to 1001 (16 threads). Of course with the 256 batch size it's not using all of them.
Modified the cut-off to 100k just for testing and it yields a 7.5x speedup.
@wezrule isn't this why we see only up to 2048 blocks at once?
@guilhermelawless yes probably, but I'm not changing it :)
Edit: It was changed in the end to 2048 * (num_threads + 1)
Resolved with #1651
Most helpful comment
8 cores results with #1614 : 2x from 1000 (single) to 1001 (16 threads). Of course with the 256 batch size it's not using all of them.
Modified the cut-off to 100k just for testing and it yields a 7.5x speedup.