Numba: stencil kernel performance not scaling in njit(parallel=True)

Created on 6 Nov 2020 · 3Comments · Source: numba/numba

import numba as nb
import numpy as np

@nb.stencil
def kernel1(a):
    return 0.25 * (a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0])

@nb.njit
def calc1(in_arr):
    out_arr = kernel1(in_arr)
    return out_arr

@nb.njit(parallel= True) 
def calc2(in_arr):
    out_arr = kernel1(in_arr)
    return out_arr

@nb.njit
def calc3(in_arr):
    out_arr = np.zeros(in_arr.shape)
    for i in range(1, in_arr.shape[0]-1):
        for j in range(1, in_arr.shape[1]-1):
            out_arr[i,j] = 0.25 * (in_arr[i, j+1] + in_arr[i+1, j] + in_arr[i, j-1] + in_arr[i-1, j])
    return out_arr

@nb.njit(parallel=True)
def calc4(in_arr):
    out_arr = np.zeros(in_arr.shape)
    for i in range(1, in_arr.shape[0]-1):
        for j in range(1, in_arr.shape[1]-1):
            out_arr[i,j] = 0.25 * (in_arr[i, j+1] + in_arr[i+1, j] + in_arr[i, j-1] + in_arr[i-1, j])
    return out_arr


input_arr = np.linspace(1, 50000, 500000000).reshape((50000, 10000))

For a large matrix (input_arr), I am experiencing the following run times (avg. 5 runs):
calc1(input_arr) [njit + stencil] - 2.09s
calc2(input_arr) [Parallel njit + stencil] - 2.35s
calc3(input_arr) [njit + Loops] - 2.10s
calc4(input_arr) [Parallel njit + Loops] - 1.08s

As per the above timings, njit + stencil is performing similar to njit + Loops, as by-default the stencil kernels are not executed in parallel.
I am trying to understand why Parallel njit + stencil is not performing similar to Parallel njit + Loops?
How can I parallelize the stencil execution?
I have 16 threads available on my machine.

Any solution will be highly appreciated. Thank you!

The first time this issue was mentioned in Github - https://github.com/numba/numba/issues/2982

[x] I have tried using the latest released version of Numba (most recent is
visible in the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
[x] I have included below a minimal working reproducer (if you are unsure how
to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

ParallelAccelerator performance - run time

Source

animator

All 3 comments

I can confirm the performance seen here. I saw multiple threads running and the process taking 200% cpu time in parallel=True versions.

Ping @DrTodd13.

sklam on 6 Nov 2020

👀1

@animator I believe the problem is that when stencil parfor calls a kernel it allocates the result with np.full with the stencil's cval. I guess np.full with 0 is much slower than np.zeros. If you pre-allocate the output using np.zeros and then pass it to the kernel as calc2(in_arr, out=out_arr) then I think you'll see that that is equivalent to calc4 (my local testing shows this to be true). This is a workaround but I am working on a permanent fix to allocate the output with np.empty and to only fill in the borders with cval. Hopefully that is much faster than np.full.