Pybind11: Overhead of calling C++ function from python using pybind

Created on 7 Sep 2020 · 1Comment · Source: pybind/pybind11

Hello,

I first raised this question on SO, but after digging more into it, I believe this issue has something to do with interaction of python and c++ via pybind.

As explained by @YannickJadoul in #1042, one way in which C++ vector can be exposed on python side without copying is this:

template <typename Sequence,
          typename = std::enable_if_t<std::is_rvalue_reference_v<Sequence&&>>>
inline py::array_t<typename Sequence::value_type> as_pyarray(Sequence&& seq) {
  auto size = seq.size();
  auto data = seq.data();
  auto seq_ptr = std::make_unique<Sequence>(std::move(seq));
  auto capsule = py::capsule(seq_ptr.get(), [](void* p) {
    std::unique_ptr<Sequence>(reinterpret_cast<Sequence*>(p));
  });
  seq_ptr.release();
  return py::array(size, data, capsule);
}

std::vector<float> cumsum(const std::vector<float>& nums) {
  std::vector<float> result(nums.size());
  float sum = 0;
  for (size_t i = 0; i < nums.size(); ++i) {
    sum += nums[i];
    result[i] = sum;
  }
  return result;
}

PYBIND11_MODULE(derived, m) {
  m.def("cumsum", [](const std::vector<float>& nums) {
    auto result = cumsum(nums);
    return as_pyarray(std::move(result));
  });
}

As you can see in the linked SO issue, this performs rather poorly compared to np.cumsum

[nav] In [24]: x = np.arange(100_000, dtype=np.float32)

[nav] In [25]: %timeit np.cumsum(x)
295 µs ± 34.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

[ins] In [26]: %timeit derived.cumsum(x)
9.26 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

My first reaction was 'great, numpy must be doing some crazy vectorization' (though is was not immediately obvious to me how since there is a data dependency between each iteration of the loop).

I modified the c++ code to do this:

std::pair<uint64_t, std::vector<float>> cumsum(const std::vector<float>& nums) {
  auto start = sc::high_resolution_clock::now();
  const auto size = nums.size();
  std::vector<float> result(size);
  float sum = 0;
  for (size_t i = 0; i < size; ++i) {
    sum += nums[i];
    result[i] = sum;
  }
  auto end = sc::high_resolution_clock::now();
  auto taken = sc::duration_cast<sc::microseconds>(end - start);

  return std::make_pair(taken.count(), std::move(result));

  // somewhere down
  m.def("cumsum", [](const std::vector<float>& nums) {
    auto [taken, result] = cumsum(nums);
    return std::make_pair(taken, as_pyarray(std::move(result)));
  });

}

And then a simple python function like this:

def run(n):      
    c = []              
    py = []       
    x = np.arange(100_000, dtype=np.float32)
    for _ in range(n):
        start = time.time()
        t, _ = derived.cumsum(x)
        end = time.time()
        c.append(t)
        py.append(end - start)
    return f'{np.mean(c)}us', f'{np.mean(py) * 1000:.3f}ms'

Running this with run(100), I get the following numbers:

[ins] In [30]: run(100)                                             
Out[30]: ('140.76us', '12.167ms')

Well that's interesting. Time spent on C++ side is minuscule as compared to the cost of invocation via python side.

Note that I verified that this is indeed a zero copy solution by printing data() of my vector and x.__array_interface__ of numpy array on python side, and they were indeed the same address.

Am I doing something stupid there? Clearly the overhead can' be that high since numpy (using c api of python) is able to get the result quickly.

Source

skgbanga

Most helpful comment

In the gitter channel, @YannickJadoul suggested that I change my cumsum implementation to directly take py_array_t<float> since I am still doing a copy from np.array to std::vector<float> in my original implementation. Here is new code:

std::pair<uint64_t, std::vector<float>> cumsum(py::array_t<float> nums) {
  auto unchecked = nums.unchecked();
  auto start = sc::high_resolution_clock::now();
  const auto size = nums.size();
  std::vector<float> result(size);
  float sum = 0;
  for (size_t i = 0; i < size; ++i) {
    sum += unchecked[i];
    result[i] = sum;
  }
  auto end = sc::high_resolution_clock::now();
  auto taken = sc::duration_cast<sc::microseconds>(end - start);

  return std::make_pair(taken.count(), std::move(result));
}

Running this on run function above gives following numbers:

('149.79us', '153.148us')

Yannick further explained that when casting a np.array to std::vector, pybind treats that as a sequence of objects and casts each object to float and verified the conversion succeeds. A much better and faster way is to use py::array_t<float> directly to avoid the overhead above.

Closing the issue, and also updating SO issue.

skgbanga on 7 Sep 2020

👍2 🚀1 ❤1

>All comments

std::pair<uint64_t, std::vector<float>> cumsum(py::array_t<float> nums) {
  auto unchecked = nums.unchecked();
  auto start = sc::high_resolution_clock::now();
  const auto size = nums.size();
  std::vector<float> result(size);
  float sum = 0;
  for (size_t i = 0; i < size; ++i) {
    sum += unchecked[i];
    result[i] = sum;
  }
  auto end = sc::high_resolution_clock::now();
  auto taken = sc::duration_cast<sc::microseconds>(end - start);

  return std::make_pair(taken.count(), std::move(result));
}

Running this on run function above gives following numbers: