Hello,
I first raised this question on SO, but after digging more into it, I believe this issue has something to do with interaction of python and c++ via pybind.
As explained by @YannickJadoul in #1042, one way in which C++ vector can be exposed on python side without copying is this:
template <typename Sequence,
typename = std::enable_if_t<std::is_rvalue_reference_v<Sequence&&>>>
inline py::array_t<typename Sequence::value_type> as_pyarray(Sequence&& seq) {
auto size = seq.size();
auto data = seq.data();
auto seq_ptr = std::make_unique<Sequence>(std::move(seq));
auto capsule = py::capsule(seq_ptr.get(), [](void* p) {
std::unique_ptr<Sequence>(reinterpret_cast<Sequence*>(p));
});
seq_ptr.release();
return py::array(size, data, capsule);
}
std::vector<float> cumsum(const std::vector<float>& nums) {
std::vector<float> result(nums.size());
float sum = 0;
for (size_t i = 0; i < nums.size(); ++i) {
sum += nums[i];
result[i] = sum;
}
return result;
}
PYBIND11_MODULE(derived, m) {
m.def("cumsum", [](const std::vector<float>& nums) {
auto result = cumsum(nums);
return as_pyarray(std::move(result));
});
}
As you can see in the linked SO issue, this performs rather poorly compared to np.cumsum
[nav] In [24]: x = np.arange(100_000, dtype=np.float32)
[nav] In [25]: %timeit np.cumsum(x)
295 碌s 卤 34.9 碌s per loop (mean 卤 std. dev. of 7 runs, 1000 loops each)
[ins] In [26]: %timeit derived.cumsum(x)
9.26 ms 卤 137 碌s per loop (mean 卤 std. dev. of 7 runs, 100 loops each)
My first reaction was 'great, numpy must be doing some crazy vectorization' (though is was not immediately obvious to me how since there is a data dependency between each iteration of the loop).
I modified the c++ code to do this:
std::pair<uint64_t, std::vector<float>> cumsum(const std::vector<float>& nums) {
auto start = sc::high_resolution_clock::now();
const auto size = nums.size();
std::vector<float> result(size);
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += nums[i];
result[i] = sum;
}
auto end = sc::high_resolution_clock::now();
auto taken = sc::duration_cast<sc::microseconds>(end - start);
return std::make_pair(taken.count(), std::move(result));
// somewhere down
m.def("cumsum", [](const std::vector<float>& nums) {
auto [taken, result] = cumsum(nums);
return std::make_pair(taken, as_pyarray(std::move(result)));
});
}
And then a simple python function like this:
def run(n):
c = []
py = []
x = np.arange(100_000, dtype=np.float32)
for _ in range(n):
start = time.time()
t, _ = derived.cumsum(x)
end = time.time()
c.append(t)
py.append(end - start)
return f'{np.mean(c)}us', f'{np.mean(py) * 1000:.3f}ms'
Running this with run(100), I get the following numbers:
[ins] In [30]: run(100)
Out[30]: ('140.76us', '12.167ms')
Well that's interesting. Time spent on C++ side is minuscule as compared to the cost of invocation via python side.
Note that I verified that this is indeed a zero copy solution by printing data() of my vector and x.__array_interface__ of numpy array on python side, and they were indeed the same address.
Am I doing something stupid there? Clearly the overhead can' be that high since numpy (using c api of python) is able to get the result quickly.
In the gitter channel, @YannickJadoul suggested that I change my cumsum implementation to directly take py_array_t<float> since I am still doing a copy from np.array to std::vector<float> in my original implementation. Here is new code:
std::pair<uint64_t, std::vector<float>> cumsum(py::array_t<float> nums) {
auto unchecked = nums.unchecked();
auto start = sc::high_resolution_clock::now();
const auto size = nums.size();
std::vector<float> result(size);
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += unchecked[i];
result[i] = sum;
}
auto end = sc::high_resolution_clock::now();
auto taken = sc::duration_cast<sc::microseconds>(end - start);
return std::make_pair(taken.count(), std::move(result));
}
Running this on run function above gives following numbers:
('149.79us', '153.148us')
Yannick further explained that when casting a np.arraypy::array_t<float> directly to avoid the overhead above.
Closing the issue, and also updating SO issue.
Most helpful comment
In the gitter channel, @YannickJadoul suggested that I change my
cumsumimplementation to directly takepy_array_t<float>since I am still doing a copy fromnp.arraytostd::vector<float>in my original implementation. Here is new code:Running this on
runfunction above gives following numbers:Yannick further explained that when casting a np.array to std::vector, pybind treats that as a sequence of objects and casts each object to float and verified the conversion succeeds. A much better and faster way is to use
py::array_t<float>directly to avoid the overhead above.Closing the issue, and also updating SO issue.