At present (b7dfe5cc), pybind11 proper only benchmarks compile-time and and artifact size for one given test setup (which tests arguments, simple inheritance, but that's about it, I think); the results of which can be seen here:
https://pybind11.readthedocs.io/en/stable/benchmark.html
https://github.com/pybind/pybind11/blob/v2.6.1/docs/benchmark.rst
However, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution. Generally, benchmarking is done on an ad-hoc basis (totes works, but may make it difficult for less creative people like myself ;)
Primary motivating issues / PRs:
Secondary:
pybind11 (out of scope: other binding approaches)dlopen - ish stuff, pybind11 internals upstart, binding registration, ...)pybind11 finds the most important (how to weigh compile-time, size, speed, memory, etc.)Given that performance benchmarks can be a P.I.T.A. (e.g. how to OS + interrupts, hardware capacity / abstractions, blah blah), ideally decisions should be made about relative performance on the same machine. Ideally, we should also publish some metrics for a given config to give people a "feel" for the performance, as was done for compile time.
github.com/pybind/pybind-benchmarks ?pytest-benchmark@wjakob @rwgk @rhaschke @YannickJadoul @bstaletic @henryiii @ax3l
Can I ask what y'all think? Is this redundant w.r.t. what we already have?
This code was used for Google-internal micro-benchmarking:
https://drive.google.com/file/d/1EOGU_A28oBvzoLwdmo2RpImjyL6bv2rl/view?usp=sharing
The link is accessible only to select people. pybind11 maintainers have access already.
The original author is @kkimdev. He generously gave us permission to reuse what's useful for our purposes (under the OSS pybind11 org). I think we will need major adaptions for the code to work outside the Google environment.
Here is a doc with background and results of the micro-benchmarking:
https://docs.google.com/document/d/1UieJ9WZ9YVJLt_EsIYj4Ahw8OIb3fH6mOIAPQRzj46k/
The link is accessible only to the same group of people. The doc is meant to inform our work on benchmarks. Please do not make copies.
As before, the original author is @kkimdev.
A benchmarking tip: I think combining C++ sampling profiler with Python benchmark is extremely useful. My latest internal benchmark script (not the above one) supports --run_pprof=True flag that runs pprof sampling profiler (but not sure if the open source version is as good as the internal, and probably there are other good open source C++ sampling profilers as well) and reports it. The flame graphs in my doc are from that.
This is a bit annoying if external contributors would feel a calling to help out.
Future people without access, contributions and ideas are still welcome; sharing that code is up to the authors, but ideas can be discussed :-)
Yes, sorry, but getting clearance to make this info fully public is likely
quite a bit of trouble. But we can add interested people, and we can
summarize or report what we learn here.
On Wed, Dec 30, 2020 at 14:57 Yannick Jadoul notifications@github.com
wrote:
This is a bit annoying if external contributors would feel a calling to
help out.Future people without access, contributions and ideas are still welcome;
sharing that code is up to the authors, but ideas can be discussed :-)—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pybind/pybind11/issues/2760#issuecomment-752783067,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAFUZAAKRWQM57UBQPTANOLSXOV6HANCNFSM4VOVTTTA
.
No worries, @rwgk, I completely understand!
Just trying to make sure that interested users/contributors (and maybe future maintainers ;-) ) aren't scared off :-)
Also looking at the following to gauge how to possibly benchmark the CPython / PyPy low-level bits - for stuff like #2050-type stuff, in addition Kibeom's suggestion for pprof:
Since cppyy was mentioned: I use pytest-benchmark as well (see: https://bitbucket.org/wlav/cppyy/src/master/bench/). It's hard to write good benchmarks, though, as features and defaults differ. For example, releasing the GIL by default is costly for micro-benches (and for large, costly, C++ functions, the bindings overhead doesn't matter). Another big expense is object tracking for identity matching on returns, which not all binders do (and is useless for micro-benches).
For real high performance, the processor matters as well. For example, PyPy has guards on object types in its traces, based on which a specific C++ overload selected by cppyy will be compiled in. On a processor with good branch prediction and a deep out-of-order execution queue, that overhead will not show up in wall clock time (assuming no hyper-threading, of course), but it will be measurable on a processor with simpler cores.
When sticking to CPython only, consider also that CFunction objects have seen a massive amount of support in the form of specialized tracks through the CPython interpreter since release 3. (This is what makes SWIG in "builtin" mode, not the default, absolutely smoke everything else.) Only since 3.8 have closures seen some love, with the API stabilized in 3.9. There's a 30% or so reduction in call overhead in there somehow (for cppyy), but it's proving to be quite a lot of work to implement.
That last point, CPython internal developments and the need to track/make use of them, is also why I'd be interested if the proposed benchmarks end up being made public. The only way of measuring the usefulness of such changes is by having a historic record to compare against and setting that up is quite some effort (esp. when switching development machines regularly).
Most helpful comment
Since cppyy was mentioned: I use pytest-benchmark as well (see: https://bitbucket.org/wlav/cppyy/src/master/bench/). It's hard to write good benchmarks, though, as features and defaults differ. For example, releasing the GIL by default is costly for micro-benches (and for large, costly, C++ functions, the bindings overhead doesn't matter). Another big expense is object tracking for identity matching on returns, which not all binders do (and is useless for micro-benches).
For real high performance, the processor matters as well. For example, PyPy has guards on object types in its traces, based on which a specific C++ overload selected by cppyy will be compiled in. On a processor with good branch prediction and a deep out-of-order execution queue, that overhead will not show up in wall clock time (assuming no hyper-threading, of course), but it will be measurable on a processor with simpler cores.
When sticking to CPython only, consider also that CFunction objects have seen a massive amount of support in the form of specialized tracks through the CPython interpreter since release 3. (This is what makes SWIG in "builtin" mode, not the default, absolutely smoke everything else.) Only since 3.8 have closures seen some love, with the API stabilized in 3.9. There's a 30% or so reduction in call overhead in there somehow (for cppyy), but it's proving to be quite a lot of work to implement.
That last point, CPython internal developments and the need to track/make use of them, is also why I'd be interested if the proposed benchmarks end up being made public. The only way of measuring the usefulness of such changes is by having a historic record to compare against and setting that up is quite some effort (esp. when switching development machines regularly).