pybind11 does not seem to correctly pass containers of char * to Python. I'm using the example of std::array<const char *, N>, but std::vectors have the same behavior. The char * received by C++ seems to be pointing to a random memory location. The memory is still within the process, as there are no segfaults (that I have seen, anyway).
This demo class has two separate constructors. One takes the type std::array<const char *, 2>, and the other takes the types const char *, const char*. The std::array version does the wrong thing, and the char *, when printed, just prints out garbage. The char *, char * version does the right thing, and prints the passed in string as expected.
I'm using pybind v2.5.0. I went through the docs on strings but didn't notice anything. Just guessing, but I would guess the problem might be shallow copies of pointers somewhere, since this only happens with containers. But I really have no clue.
This problem does not surface whenever std::string is used, which is another reason I think it may be related to shallow copies of pointers (since std::string will copy its data when copied).
#include <vector>
#include <stdio.h>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
namespace py = pybind11;
class Demo {
public:
Demo(std::array<const char *, 2> strs) {
printf("Got %s %s\n", strs[0], strs[1]);
}
Demo(const char * str1, const char * str2) {
printf("Got %s %s\n", str1, str2);
}
};
PYBIND11_MODULE(demo, m) {
// the real deal here
py::class_<Demo>(m, "Demo")
.def(py::init<std::array<const char *, 2>>())
.def(py::init<const char *, const char *>())
;
}
After building, the following Python code demonstrates the issue:
import demo
# prints random garbage
demo.Demo(['string 1', 'string 2'])
# prints the right thing
demo.Demo('string1', 'string2')
You probably need to make the container type opaque. The manual describes how to do so
Thanks for the pointer, @molpopgen. In my case I can modify the wrapped class (Demo in the example code) to just take std::string, but in general it may not be possible for people to do that.
I can think of three behaviors that would be preferable to the current behavior:
str to char *.As a reminder, the current behavior is that the bound code receives a pointer to seemingly random data. Silently corrupt data is never fun.
I gave you bad advice! Sorry. Pybind11 converts python strings to c++ strings for good reason, to avoid memory leaks. Passing in pointer as you are doing makes ownership unclear, and therefore up to you. You can write a custom init to handle these cases, taking care to avoid memory leaks.
Failing to compile wouldn't be correct because it makes assumptions about ownership. Auto conversion may not be correct because the char * may not be null-terminated. Etc.. It is up to you to avoid UB and leaks here.
Hmmm, maybe I'm not stating the problem clearly enough. pybind11 does the expected thing (with some caveats) when a function's parameter is a char *. pybind11 will encode it as utf-8, according to the docs here: https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html#passing-python-strings-to-c. To quote the docs:
When a Python str is passed from Python to a C++ function that accepts std::string or char * as arguments, pybind11 will encode the Python string to UTF-8. All Python str can be encoded in UTF-8, so this operation does not fail.
I think it's a bug that it doesn't do that when the parameter is a container of char *; I like the bare char * pybind11 behavior, and think pybind11 should do the same thing with char * containers.
I assume that for bare char * pybind11 copies the Python data to create a temporary bare char * and frees the memory after the memory after the function call completes?
Sorry for the delay in response—I'm in the middle of buying and selling houses/renovation.
I see what you are saying. I just see a vector of char * to be a red flag in this case. If you define a c++ function taking such a vector, and want pybind11 to convert python data for you, then there may be no destructors for the pointers? I've not looked at the internal code to check, though.
Why would pybind11 do something different for a bare char * as it does for a vector of them? Can't it just do the same thing, i.e. encode the Python strs, pass the encoded data, then free the data? pybind11 is already special casing char * to treat it as a string in some cases and it should be consistent.
Regardless, the current behavior is just plain wrong: a vector of char* ends up with the char * pointing to complete garbage.... have you run the demo code?
Why would pybind11 do something different for a bare
char *as it does for a vector of them?
You'd have to look at the code, but it is likely working in a loop and so the temporary is going out of scope. The current behavior is wrong, and probably either shouldn't compile or should warn and say that it isn't doing what you want.
For a class that cannot change from vector of pointers to strings, etc., this is a workaround, although we're not correctly addressing ownership here:
#include <vector>
#include <stdio.h>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
namespace py = pybind11;
class Demo
{
public:
Demo(std::array<const char *, 2> strs)
{
printf("Got %s %s\n", strs[0], strs[1]);
}
Demo(const char *str1, const char *str2)
{
printf("Got %s %s\n", str1, str2);
}
};
PYBIND11_MODULE(demo, m)
{
// the real deal here
py::class_<Demo>(m, "Demo")
.def(py::init([](std::array<std::string, 2> a) {
// NOTE: the behavior in here depends on what the class
// is actually doing, as we're in memory leak territoy
// in general. This works for this specific example, though.
return Demo(std::array<const char *, 2>{a[0].c_str(), a[1].c_str()});
}))
.def(py::init<const char *, const char *>());
}
Thanks for the feedback, @molpopgen. That's a nice clean workaround.
I'm not trying to be dense, but I don't see where a memory leak could come from in the std::array<...> example: won't the array of std::string get destroyed when the __init__ call finishes? I see potential for other undefined behavior though: if the Demo class expects the caller to keep the array of char *s alive, and accesses them later, that's undefined behavior land... a sad place to be.
it all depends on what the fate of that container is, and on how the pointers were originally create (.c_str, new, malloc from some C lib?). You can get UB like you said, or a memory leak if somewhere in the path to making the class we need deep copies, etc.. It really depends on the details. To me, a container of bare pointers is a red flag. That's all I'm saying.
To me, a container of bare pointers is a red flag. That's all I'm saying.
Gotchya. I probably reached for it because of my relative familiarity with C instead of C++... I'm afraid my code may have a distinctly C-ish smell to it :-. Hopefully getting better every day though.
Pybind doesn't try to work with C style arrays by design. You need a bit of cleverness to make it work, but it's certainly possible.
Is there still an issue here or should we close this?
The issue is that a bare char * function parameter works as explained in https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html#passing-python-strings-to-c. pybind11 automatically interprets a char * as a UTF-8-encoded string, but does not do the same with containers of char *. In my opinion, either refusing to compile a container of char * or working the same as when passing a bare char * would be good. As is, whatever method takes a container of char * ends up just getting a pointer to random junk (probably use after free, but I haven't tried with valgrind or anything).
The issue here is that the type caster for lists (to std::vector, std::array, etc) recursively calls the type caster of its nested type:
In the case of char*, the type caster stores a string object and passes a pointer to its data (const_casting the c_str()'s const away as well). So the moment this type caster for the nested char* goes out of scope, the data disappears as well.
This is definitely a bug, but I'm not immediately sure what's the best way of fixing it.