Hi,
i work on Transonic, which is a pure Python package (requiring Python >= 3.6) to easily accelerate modern Python-Numpy code with different accelerators (currently Cython, Pythran and Numba).
Reading the discussions you had about potentially using Pythran and Numba (for example in https://github.com/scikit-image/scikit-image/issues/2956 and https://github.com/scikit-image/scikit-image/pull/3226), it seems to me that Transonic could be in the long term useful for scikit-image. But it is not my point in this issue.
I'm working on some scikit-image kernels (see https://bitbucket.org/fluiddyn/transonic/src/default/doc/for_dev/scikit-image/). My goal is to improve Transonic such that good Pythran, Numba and Cython codes can be produced from one unique code using Transonic. Of course, I need to have good benchmarks for these kernels. For these benchmarks, I need realistic input parameters for the main functions in some of your Cython files. i think it would be much less work for you than for me to produce the small pieces of code producing these input parameters (I don't know the kernels and I would have to guess too much).
Here is a list of the functions I'm going to target:
Of course, I don't need right now these pieces of code for all functions, but if you could post here codes for few functions, it would help me a lot.
@paugier I think you meant scikit-image rather than scikit-learn, and have taken the liberty of editing your comment accordingly. =)
Everyone else: I exchanged a few emails with Pierre before encouraging him to post here. To summarise our discussions:
I won't have time to help Pierre out until early next year, so if anyone can jump in before that, we would very much appreciate it! =)
I think you meant scikit-image rather than scikit-learn
oops soory 🤦♂️. At least I didn't post the question on the wrong repository!
I would really like to encourage this kind of exploration. I haven't looked
at Transonic in particular though.
I think some mechanism for us to test things without a deprecation policy
is important, especially seeing that problems often arise at the
installation step.
I think a subdirectory like "future" would be really useful in that regard.
Infrastructure wise, if you want to see this out within a year, I think it
would need something like 2 release iterations before we understand all the
inolications. At our current release cadence, you would expect to see the
second release at the 9month anniversary of your contribution, and your
modules moved out of future at the 15th month.
If it takes more than 1 year in the testing phase, I think it would be an
insult to your time investment. At the current rate of releasing packages,
we would barely make that timeline.
I've advocated for more frequent releases I the past, but I think this kind
of experimentation wasn't considered. I would like to say that 0.14 was/is
holding us back, but I think it is a little more than that. Downloading
then re-uploading wheels is a manual step. So is checking for contribution
attributions, and finally building documentation.
We aren't doing bad in terms of release cadence, just we could be doing
alot better, and I personally think it would encourage more contributions
like this one.
Installing a compiler is still pretty hard on windows, and also requires
dependencies that aren't installable via conda on Mac.
On Tue, Oct 1, 2019, 18:13 Pierre Augier notifications@github.com wrote:
I think you meant scikit-image rather than scikit-learn
oops soory 🤦♂️. At least I didn't post the question on the wrong
repository!—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-image/scikit-image/issues/4199?email_source=notifications&email_token=AAAV7GCCXBNK4ZXUI72Q5BTQMNSHZA5CNFSM4I4AURL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEABT7XQ#issuecomment-537083870,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAV7GFCL7VL43652N5KCQDQMNSHZANCNFSM4I4AURLQ
.
Installing a compiler is still pretty hard on windows, and also requires dependencies that aren't installable via conda on Mac.
Note that for packages like scikit-image, Transonic would have to be used in its ahead-of-time mode (using the boost decorator) and that no compilers would be needed for the users. The binaries would be shipped in the wheels / conda packages.
In that respect, it wouldn't change much compared to using directly Cython.
Of course, for the Numba backend, Numba and Llvmlite are runtime dependencies (but it is less a problem than installing Visual Studio!).
If it takes more than 1 year in the testing phase, I think it would be an insult to your time investment.
Let me say that I am not in a hurry. Transonic is anyway a long-term project and I don't see how it could become a standard in less than one year.
For example, Transonic depends on Python>=3.6, so it can't be used in a release supporting Python 3.5.
Moreover, issues will have to be fixed in Cython (mainly since Transonic uses the pure-Python mode of Cython, see https://transonic.readthedocs.io/en/latest/backends/cython.html). So we need time to fix them and then time for a Cython release.
An example of what i would need (here for the function cmorph._dilate)
setup = """
import numpy as np
from cmorph import _dilate
rows = 1024
cols = 1024
srows = 64
scols = 64
image = np.random.randint(0, 255, rows * cols, dtype=np.uint8).reshape(
(rows, cols)
)
selem = np.random.randint(0, 1, srows * scols, dtype=np.uint8).reshape(
(srows, scols)
)
out = np.zeros((rows, cols), dtype=np.uint8)
shift_x = np.int8(2)
shift_y = np.int8(2)
"""
stmt = "_dilate(image, selem, out, shift_x, shift_y)"
In this case, I'm quite happy with the preliminary result. Transonic can produce a Cython code as efficient as your manually written Cython code from this code (and there is a very good speedup compare to Cython with the Pythran and Numba backends):
import numpy as np
from transonic import boost, Optional, Array
A = Array[np.uint8, "2d", "memview"]
@boost(wraparound=False, boundscheck=False)
def _dilate(
image: A,
selem: A,
out: Optional[A] = None,
shift_x: np.int8 = 0,
shift_y: np.int8 = 0,
):
"""Return greyscale morphological dilation of an image.
Morphological dilation sets a pixel at (i,j) to the maximum over all pixels
in the neighborhood centered at (i,j). Dilation enlarges bright regions
and shrinks dark regions.
Parameters
----------
image : ndarray
Image array.
selem : ndarray
The neighborhood expressed as a 2-D array of 1's and 0's.
out : ndarray
The array to store the result of the morphology. If None, is
passed, a new array will be allocated.
shift_x, shift_y : bool
shift structuring element about center point. This only affects
eccentric structuring elements (i.e. selem with even numbered sides).
Returns
-------
dilated : uint8 array
The result of the morphological dilation.
"""
rows: np.intp = image.shape[0]
cols: np.intp = image.shape[1]
srows: np.intp = selem.shape[0]
scols: np.intp = selem.shape[1]
centre_r: np.intp = int(selem.shape[0] / 2) - shift_y
centre_c: np.intp = int(selem.shape[1] / 2) - shift_x
image = np.ascontiguousarray(image)
if out is None:
out = np.zeros((rows, cols), dtype=np.uint8)
selem_num: int = np.sum(np.asarray(selem) != 0)
sr: Array[np.intp, "1d", "memview"] = np.empty(selem_num, dtype=np.intp)
sc: Array[np.intp, "1d", "memview"] = np.empty(selem_num, dtype=np.intp)
s: int = 0
r: np.intp
c: np.intp
for r in range(srows):
for c in range(scols):
if selem[r, c] != 0:
sr[s] = r - centre_r
sc[s] = c - centre_c
s += 1
local_max: np.int8
value: np.uint8
rr: np.intp
cc: np.intp
for r in range(rows):
for c in range(cols):
local_max = 0
for s in range(selem_num):
rr = r + sr[s]
cc = c + sc[s]
if 0 <= rr < rows and 0 <= cc < cols:
value = image[rr, cc]
if value > local_max:
local_max = value
out[r, c] = local_max
return np.asarray(out)
Thanks for the concrete example @paugier (I took the liberty of enabling Python code highlighting for your previous comment)
I have not had time to take a look at transonic in any detail, but it looks very interesting.
To be clear, what you are asking for is whether we can provide example setup strings for other functions that were mentioned in your first comment?
@paugier that's brilliant! I looove having pure Python annotations result in compiled speedups! And I'm happy to hear that your time horizon matches ours. =)
When you say "a very good speedup", what exactly are we talking about? =)
@grlee77
what you are asking for is whether we can provide example setup strings for other functions that were mentioned in your first comment?
I think that's exactly right!
To be clear, what you are asking for is whether we can provide example setup strings for other functions that were mentioned in your first comment?
Yes! It's not always easy to guess which values should be chosen for benchmarking.
When you say "a very good speedup", what exactly are we talking about? =)
It's pretty impressive (the first line corresponds to the pyx in skimage and the 3 other lines to Transonic backends):
cython "skimage" 8.24e-04 s (= norm)
cython 8.33e-04 s (= 1.0115 * norm)
pythran 3.22e-05 s (= 0.0391 * norm)
numba 3.59e-05 s (= 0.0436 * norm)
It is for the function cmorph._dilate which is actually not even compiled in skimage!
The Pythran extension is compiled with the -DUSE_XSIMD and I'm not 100% sure about the portability of the extensions when using this option. But I didn't use -march=native so it could be portable (?).
The produced Cython code can still be improved even though the difference is not big in this case...
I was curious about fused-type support for Cython, but it looks like you may have already implemented it?:
https://transonic.readthedocs.io/en/latest/generated/transonic.typing.html#transonic.typing.Array
@paugier "pretty impressive" :joy::joy::joy::joy::joy: such modesty! That's an incredible result!
I was curious about fused-type support for Cython, but it looks like you may have already implemented it?
Fused type are already implemented in the frontend (with a quite powerful API) and in the Pythran backend. However, only very simple things work for Cython (1) because I hit some Cython bugs (also because I use their pure-Python mode) and (2) just because the Cython backend need improvement on this aspect.
However, I'd like first to focus on cases without (or with very few and simple) fused types. There are several numerical kernels without fused types in skimage so I think it would already be a good first step to use Transonic for these kernels, and first only with Cython.
But if you think about an interesting example with fused types in skimage, please tell me about it!
That's an incredible result!
It's good yes. Even too good to be true... So I'd like to better understand what happen. And it's only for one function (with very simple Cython), which moreover is not really used in skimage!
It's why it's so important for this work to have good benchmarks, and therefore good setup strings as in https://github.com/scikit-image/scikit-image/issues/4199#issuecomment-537680703
Just for information, what needs to be improved in Transonic to produce very efficient Cython code:
Low hanging fruits:
cdivision=True and nonecheck=Falsenp.ndarray[dtype=np.uint32_t, ndim=1, negative_indices=False, mode='c']And more complicated things!
with nogil:from libc.math cimport exp, pow<np.uint8_t> op_result, which in Cython is different from np.uint8(op_result) :-(cdef void foo(int[:] a) nogil:I understood a little bit more the differences of performance between Cython and Pythran/Numba. The Cython extensions were compiled with gcc 7.2 and Pythran with clang 6.0. There is clearly a performance issue with gcc. With everything compiled with clang, I get:
module: cmorph
_dilate(image, selem, out, shift_x, shift_y)
cython "skimage" 5.21e-05 s (= norm)
cython 5.92e-05 s (= 1.1365 * norm)
pythran 3.32e-05 s (= 0.6370 * norm)
numba 3.44e-05 s (= 0.6607 * norm)
It's more reasonable! However, it would be interesting to know if the extensions for some wheels (and even for conda) are not compiled with gcc? I'll have to do the same benchmark with the extension in skimage...
And i have also another result for another function, which is this time really used in skimage:
module: _greyreconstruct
reconstruction_loop(ranks, prev, next_, strides, current_idx, image_stride)
cython "skimage" 3.23e-05 s (= norm)
cython 2.75e-05 s (= 0.8501 * norm)
pythran 2.98e-05 s (= 0.9221 * norm)
numba 3.68e-05 s (= 1.1376 * norm)
In this case, the Cython written by Transonic is slightly faster than the Cython written by hand :slightly_smiling_face:
i don't know yet why. It could be because of a difference of types (?)
import numpy as np
from transonic import boost, Array
Au = Array[np.uint32, "1d", "C", "positive_indices"]
A = Array[np.int32, "1d", "C", "positive_indices"]
@boost(boundscheck=False)
def reconstruction_loop(
ranks: Au,
prev: A,
next: A,
strides: A,
current_idx: np.intp,
image_stride: np.intp,
):
current_rank: np.uint32
i: np.intp
neighbor_idx: np.intp
neighbor_rank: np.uint32
mask_rank: np.uint32
current_link: np.intp
nprev: np.int32
nnext: np.int32
nstrides: np.intp = strides.shape[0]
while current_idx != -1:
if current_idx < image_stride:
current_rank = ranks[current_idx]
if current_rank == 0:
break
for i in range(nstrides):
neighbor_idx = current_idx + strides[i]
neighbor_rank = ranks[neighbor_idx]
# Only propagate neighbors ranked below the current rank
if neighbor_rank < current_rank:
mask_rank = ranks[neighbor_idx + image_stride]
# Only propagate neighbors ranked below the mask rank
if neighbor_rank < mask_rank:
# Raise the neighbor to the mask rank if
# the mask ranked below the current rank
if mask_rank < current_rank:
current_link = neighbor_idx + image_stride
ranks[neighbor_idx] = mask_rank
else:
current_link = current_idx
ranks[neighbor_idx] = current_rank
# unlink the neighbor
nprev = prev[neighbor_idx]
nnext = next[neighbor_idx]
next[nprev] = nnext
if nnext != -1:
prev[nnext] = nprev
# link to the neighbor after the current link
nnext = next[current_link]
next[neighbor_idx] = nnext
prev[neighbor_idx] = current_link
if nnext >= 0:
prev[nnext] = neighbor_idx
next[current_link] = neighbor_idx
current_idx = next[current_idx]
A short update about this work (see also https://github.com/fluiddyn/transonic/tree/master/doc/for_dev/scikit-image).
It's very useful for Transonic. I released a new version (0.4.1) with some improvements of the Cython backend. For some kernels, Transonic 0.4.1 is able to produce efficient Cython code.
For other kernels considered in this experiment, more work is needed. The most useful missing features are (in order of importance):
with nogil:from libc.math cimport exp, pow<np.uint8_t> resultcdef void foo(int[:] a) nogil: (limited by Cython bugs)I also worked a bit on fused types for Cython. The code in Transonic (and the produced Cython code) is much better. Some very simple cases work but then we are also limited by Cython bugs.
Of course, I still need more setup codes for the benchmarks (as these files).
Thanks @paugier!
It's very useful for Transonic.
This is so great to hear. We actually tried to get a joint grant with Numba to do this kind of work, explaining that it would lead to improvements in both skimage and Numba. We didn't get funded, but your statement vindicates that application. =) Perhaps we should consider applying for a joint skimage/transonic grant at the next CZI round? (deadline Dec 1, or the next one, I think Apr 1) Would you be interested in that?
Perhaps we should consider applying for a joint skimage/transonic grant at the next CZI round?
Yes, it would be great to do that. I'll send you an email to discuss about the application. It could be good to involve Numba people and @serge-sans-paille.
A technical question: I struggle to get the same performance with the extension compiled locally from your .pyx file (skimage/morphology/_greyreconstruct.pyx in master) and the extension contained in the scikit-image wheel (I guess scikit_image-0.15.0-cp37-cp37m-manylinux1_x86_64.whl).
Now I can only test with one file so I don't know if I would get the same issue with the other extensions.
I use a very standard method to compile the .pyx files (https://github.com/fluiddyn/transonic/blob/master/doc/for_dev/scikit-image/setup_pyx.py) with gcc 7.4.0 and it less efficient (~ %10) than the extension in the wheel.
It is not a big deal but it would be good to start by exactly reproducing the performance of what is now in skimage!
Do you use anything special to compile your .pyx files to produce the wheels ?
Have a look at the wheel builder repo. It uses Matthew Brett's muktibuild system to comply with the instruction set and libc requirements of the manylinux wheel format.
Dynamic dispatch can help compile one binary that is able to use specialized instructions when it can, but I've never implemented code myself that uses that, I just heard that opencv does this
It could be good to involve Numba people and @serge-sans-paille.
Sure :-)
Plus I 100% second the approach of @paugier here. Having a single source and the ability to swith between backends looks like a very good property to me :-)
To be honest, I will try to get some funding for Transonic / Pythran and other Python accelerators.
For the applications, we really need some feedback from "the community" about the ideas associated with Transonic, so we wrote a long and serious text on these subjects: http://tiny.cc/transonic-vision.
It seems to me that this text is interesting for people using Python for science. However, I don't think I'm able to reach the potential readers.
Anyway, your points of view would be very interesting for us.
The gcc vs clang difference when compiling cython code is a bit worrying... It would be interesting to test this on other functions.
@paugier may I ask how you chose the list of functions mentioned above? I can work on providing examples for the other functions in the next few days. I think it's more interesting to concentrate on functions which take the longest time to execute (say for a 1000x1000 image).
@emmanuelle Sorry, i forget to answer... I just took them from Serge's work and by looking in your repository. You're right that we should focus on what takes time in real cases.
is transonic and more experimental stuff more appropriate to scikit-image-contrib?
@hmaarrfk I don't think that works in this case, where the purpose is to simplify and accelerate existing functions, rather than add new functionality. The purpose of skimage-contrib is to allow contributors to add more fringe/recent algorithms and methods that might not be maintainable for scikit-image, where we really just want the reference, very highly cited stuff.
I think it is important to have a place to prototype new dependencies and new compilers and build infrastructure. Seems hard to do it on "stable" software that is released every 6 month.
Please come mess around with github.com/jni/skan, among others! =P
And, sorry, I forgot, we do have such a thing, @emmanuelle started skimage-experimental, which is a mirror of skimage without its restrictions, or, in other words, a place for long-lived branches.
The problem with these endeavours, though, is that it's hard to get lots of real-world testing out of them...
For transonic, https://github.com/scikit-image/skimage-experiments seems appropriate.
One important thing is to have the same build procedure and associated packages on PyPI and conda-forge, so that it can be possible at one point to ask some users to test things with their own setup with few simple commands. It could be a way to get more real-world testing.
I propose to close this because it does not represent an issue,
and also seems to be no longer relevant.
@vsiegel we often like to keep track of ideas and experiments in issues like this one, in case we pick them up later. Transonic is definitely a candidate for further exploration!
Most helpful comment
Thanks @paugier!
This is so great to hear. We actually tried to get a joint grant with Numba to do this kind of work, explaining that it would lead to improvements in both skimage and Numba. We didn't get funded, but your statement vindicates that application. =) Perhaps we should consider applying for a joint skimage/transonic grant at the next CZI round? (deadline Dec 1, or the next one, I think Apr 1) Would you be interested in that?