Scikit-image: OpenMP vs Dask

Created on 16 Nov 2020 · 4Comments · Source: scikit-image/scikit-image

I've noticed that multi-threading is handled in two distinct ways across the library. In some places, we are using dask (restoration._cycle_spin, util.apply_parallel), while others use OpenMP (in feature._cascade or restoration.rolling_ball). Both libraries have different specialization, but for us, they perform essentially the same job. Is there a policy on which one to prefer? Is there an aim to depreciate one in favor of the other?

I was also thinking that it might be useful to make multi-threading more core to the library since a lot of CV operations benefit from additional CPUs and not everybody is an expert on this. There are some arguments against multi-threading though, so I am wondering where the core members stand on this.

I was thinking the same for vectorization and SIMD (where we could get some _real_ performance boost even for a single core); I noticed that this was partially being done already, but had a hard time assessing what's available and where there is room for improvement. For SIMD there is little to no downside; AVX2 has been around for a decade and I haven't seen a laptop or desktop CPU without it in a long time. Is there documentation I can read up on to see what's needed?

Source

FirefoxMetzger

Most helpful comment

I see OpenMP as being for use within Cython code when a specific loop may need to be run in parallel, for example. For pure python functions such as restoration.cycle_spin, OpenMP code is not an option, although there are other potential alternatives to Dask such as Python's concurrent.futures. A nice aspect of Dask is that can be adapted to a number of different schedulers so we can potentially support distributing blocks not only within a single node but potentially on a cluster (or in the future, perhaps also on GPUs via dask-cuda, etc).

Another related library you may be interested in is dask-image which is currently focused on functions from scipy.ndimage rather than scikit-image itself, but the concept could readily be extended to many things here.

For SIMD, I don't know if any of our core devs have experience writing such code and I worry that incorporating such code would tend to make implementations less readable and harder for others to maintain. Keeping implementations approachable for new contributors and non-expert programmers is one priority of scikit-image. I think we can still benefit from upstream SIMD work going on in NumPy itself, though (see NEP-38 and recent NumPy PRs related to "universal intrinsics"/NPYV). I think one of the contributors of these features to NumPy had previously worked on similar functionality for OpenCV.

grlee77 on 16 Nov 2020

👍2

All 4 comments

poking @lagru @sciunto @rfezzani @hmaarrfk

mkcor on 16 Nov 2020

grlee77 on 16 Nov 2020

👍2

The main constraints when considering the inclusion of optimized code are I think readability, maintainability and "reviewability".
Both OpenMP (via Cython) and dask are good options for this purpose and there is no strict rule or policy on which one to prefer =).
Another important condition is the portability of the code (even if there is no official support for ARM and 32-bit), this makes the inclusion of AVX code more difficult. But skimage may benefit from numpy's efforts on this side as @grlee77 pointed out :wink:.

rfezzani on 17 Nov 2020

👍1

I acknowledge maintainability, and this might be enough to completely veto SIMD; I wouldn't want to use this argument too lightly though. If your CPU supports AVX512 instructions then a __m512i can hold up to 64 uint8s or 16 floats (8 double). Here, speed scales linearly with register length, so you'd be looking at a ~64x (16x / 8x) speedup over sequential code on a single core. Also, SIMD scales _multiplicatively_ with multi-threading, not additively. Each core can now process 64 (16/8) values per instruction, which on 8 threads amounts to a potential 512x (128x/64x) speedup. For AVX2 (pretty common) you'd still be looking at half that (256x/64x/32x). Not sure if this is already factored into the decision that SIMD is not worthwhile/impossible to maintain.

I don't think the "improve upstream numpy instead" argument holds ground. Much of our critical code has moved to cython, which essentially does away with the entire numpy API (maybe some code that I haven't seen yet uses the C API?) and just treats the array as a contiguous block of memory. Improvements to upstream libraries (cython excluded) won't do much.

On this note, our codebase is set up to do auto-vectorization (not sure if this is cython's default or done explicitly by us). While it is nice to get these accidental speedups, I think it is much harder to maintain than explicitly written SIMD. You are essentially programming against a black-box (compiler) and if some random patch changes the arrangement of your carefully crafted loops your code suddenly becomes X-times slower for no apparent reason (because the compiler stops auto-vectorization).

On the multiprocessing side, I know that both Dask and OpenMP are viable. However, our OpenMP code can be rewritten to use dask (wrapping cython code inside a dask loop); this might introduce overhead in some cases, but the ability to scale to clusters or a GPU might make this decision worthwhile. Maybe it's just me, but one multi-threading dependency seems a lot better than two; especially if they do the same thing.

FirefoxMetzger on 20 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings