Describe the bug
The getaddrinfo (zsock_getaddrinfo) method fail with DNS_EAI_MEMORY error at calloc.
from subsys/net/lib/sockets/getaddrinfo.c:
int zsock_getaddrinfo(const char *host, const char *service,
const struct zsock_addrinfo *hints,
struct zsock_addrinfo **res)
{
if (IS_ENABLED(CONFIG_NET_SOCKETS_OFFLOAD)) {
return socket_offload_getaddrinfo(host, service, hints, res);
}
#if defined(CONFIG_DNS_RESOLVER)
int ret;
*res = calloc(AI_ARR_MAX, sizeof(struct zsock_addrinfo));
if (!(*res)) {
return DNS_EAI_MEMORY;
}
ret = z_zsock_getaddrinfo_internal(host, service, hints, *res);
if (ret) {
free(*res);
*res = NULL;
}
return ret;
#endif
return DNS_EAI_FAIL;
}
To Reproduce
Can look at https://github.com/nandojve/zephyr/tree/topic/tagoio_http_post_json project. It is an functional test with TagoIO IoT Platform using HTTP post.
The implementation is in int tagoio_connect(struct tagoio_context_t *ctx) method on socket.c file.
Expected behavior
Proper getaddrinfo function working independent of libC.
Impact/workaround
Don't work without add newlib C.
Environment:
Additional context
I noted that DNS on shell works independent of the lib C used.
Looks like this might depend on CONFIG_MINIMAL_LIBC_MALLOC_ARENA_SIZE and CONFIG_MINIMAL_LIBC_CALLOC for the minimal libc. I don't see a similar listing for newlib libc heap size.
@nandojve - if you add those configuration options (e.g. set arena size to 16384 and calloc to y) does it fix your issues?
If so, I would suggest that we add a dependency on those two in Kconfig for CONFIG_DNS_RESOLVER.
Thoughts, @pfalcon ?
If so, I would suggest that we add a dependency on those two in Kconfig for CONFIG_DNS_RESOLVER.
Thoughts, @pfalcon ?
I'd suggest to not mix up CONFIG_DNS_RESOLVER and POSIX getaddrinfo(), whose implementation may malloc() something. I would need to look in more detail for more specific suggestions.
Hi @cfriedt - yes, it worked thanks for the tip.
However after eleven minutes start to got same erro using 16384 for arena.
[00:11:18.243,000] <err> tagoio: Could not resolve dns, error: -10
Probably is because not all paths after _z_zsock_getaddrinfo_internal_ free the resource.
Yep, I forget to use _freeaddrinfo_.
I noted that MINIMAL_LIBC_MALLOC default is YES and MINIMAL_LIBC_MALLOC_ARENA_SIZE default is 0 so it will always fail for anyone. Do you know if there is any particular reason to keep default 0? I'm not understand the purpose to enable malloc by default without valid arena.
May HEAP_MEM_POOL_SIZE be the right value for MINIMAL_LIBC_MALLOC_ARENA_SIZE?
@pfalcon - given the above, I'm hoping that this one-liner could fix it. I'm not convinced that HEAP_MEM_POOL is the right config option to use, but I could be wrong.
diff --git a/subsys/net/lib/dns/Kconfig b/subsys/net/lib/dns/Kconfig
index 99430d58d4..3954a5b50d 100644
--- a/subsys/net/lib/dns/Kconfig
+++ b/subsys/net/lib/dns/Kconfig
@@ -3,6 +3,7 @@
config DNS_RESOLVER
bool "DNS resolver"
+ depends on NEWLIB_LIBC || (MINIMAL_LIBC && MINIMAL_LIBC_MALLOC_ARENA_SIZE > 0)
help
This option enables the DNS client side support for Zephyr
@nandojve:
To Reproduce
Can look at https://github.com/nandojve/zephyr/tree/topic/tagoio_http_post_json project.
Please try to provide: either a) (recommended) reproduction case within bounds of the main Zephyr tree (i.e., take an existing example/test, provide small patch which exposes the issue); or b) if that's not possible, minimal self-contained code which just shows the issue. Thanks.
@cfriedt:
@pfalcon - given the above ...
My response remains the same - the issue identified (getaddrinfo() implementation calling calloc()) has nothing to do with CONFIG_DNS_RESOLVER (nowhere does subsys/net/lib/dns/* call malloc() or calloc()). Consequently, patching CONFIG_DNS_RESOLVER is not an acceptable solution.
Let's now think what could be:
One obvious solution is not to use dynamic allocation. But the usual suspect is involved: a74137f665d502e8eab1278315f975133dcbfe59. So, as one option for solution you may want to submit a revert for that commit.
Other option would involve invoking your idea of what in libc should work out of the box. And while we disagree with you re: whether scanf() should work OOB (vs be enabled separately), I may assure you we'd agree that malloc() should work OOB. But let's look what we have in the case of minlibc:
~
config MINIMAL_LIBC_MALLOC
default y
config MINIMAL_LIBC_MALLOC_ARENA_SIZE
default 0
depends on MINIMAL_LIBC_MALLOC
~
I.e., with minlibc, malloc() is enabled by default, but it's enabled for immediate failure, as there's nothing to alloc from. This totally makes no sense. malloc() should work out of the box, at least for minimal alloc sizes. People who don't want malloc overhead, should disable it explicitly, people who want beyond-minimal alloc sizes, should configure them explicitly. So, if you submit a patch like:
~
config MINIMAL_LIBC_MALLOC_ARENA_SIZE
int "Size of the minimal libc malloc arena"
- default 0
+ default 256
~
, I'll approve it.
Hmm... I see your point. Malloc isn't strictly required for getaddrinfo to work. Maybe in this case it should just be left to the application's configuration.
Malloc isn't strictly required for getaddrinfo to work.
Malloc isn't required for native Zephyr DNS_RESOLVER. After the commit quoted above, https://github.com/zephyrproject-rtos/zephyr/commit/a74137f665d502e8eab1278315f975133dcbfe59, malloc is required for getaddrinfo() to work (even though initially it was purposely written to not depend on malloc).
Technically malloc isn't required for getaddrinfo if CONFIG_NET_SOCKETS_OFFLOAD=y. I get that it was originally meant to work without malloc though.
I'll whip together a patch for option number 2. @nandojve - does that work for you? It means that malloc works OOTB rather than fail OOTB. Also means that malloc will need to be explicitly disabled or the arena size must be set the size to 0 if they want to save those 256 bytes.
I noted that MINIMAL_LIBC_MALLOC default is YES and MINIMAL_LIBC_MALLOC_ARENA_SIZE default is 0 so it will always fail for anyone. Do you know if there is any particular reason to keep default 0? I'm not understand the purpose to enable malloc by default without valid arena.
Because the way malloc() is specifically implemented in minimal libc, the RAM for the malloc arena has to be reserved in advance. Any value of CONFIG_MINIMAL_LIBC_MALLOC_ARENA_SIZE is directly added to the RAM footprint, which is why this defaults to 0.
But the usual suspect is involved: a74137f.
@pfalcon passive aggressive stuff like this is not helping your cause. Especially since I made the change because your original code was not thread-safe, and the spec LITERALLY says an allocation is made. I know you don't like reading specs (witness the recent nonsense with ioctl() and how many args it can take) but you can't just ignore thread safety when working on an operating system.
@nandojve - I think the current resolution to this is that we are considering it an application config issue, so the owness is put onto the user to set CONFIG_MINIMAL_LIBC_MALLOC_ARENA_SIZE.
I put together a change in #26951 that will hopefully make it more obvious to users.
I hope that helps to resolve your issue.
I'll close this once #26951 is merged.
@cfriedt @pfalcon thank you for address this issue.
While there was some great work on closing unfixed issues over weekend, I'm no so fast, so please let me go over over this ticket, #26932, and #26951, and reply to some comments.
@cfriedt, from https://github.com/zephyrproject-rtos/zephyr/pull/26932#issuecomment-660384628
Would anyone (other than me) like to suggest that the bug reported in #26720 is an application configuration issue then?
Well, seeing issues like this, a suggestion I have in my head is "if you use POSIX functionality, then you should use the most full C library, which is Newlib, or otherwise you will be hit by surprises again and again". I don't spell out such a suggestion, because I don't see how it's helpful to users: It's clear that in many cases RAM/ROM size is a constraint, and it's very understandable that users want to use smaller C library. So, for as long as issues in minimal libc are clearly identifiable and resolvable, I'm all for resolving them and developing minlibc further (while keeping in mind that it should stay small and simple, or otherwise, again, Newlib should be used right away).
@andrewboie, from https://github.com/zephyrproject-rtos/zephyr/pull/26932#issuecomment-660352624
Now I remember this. I changed it to perform an allocation because that's what the spec states:
The getaddrinfo() function allocates and initializes a linked list of addrinfo structures, one for each network address that matches node and service, subject to any restrictions imposed by hints, and returns a pointer to the start of the list in res. The items in the linked list are linked by the ai_next field.
I read it again and again, and don't see where it says "getaddrinfo() calls malloc()". Otherwise, as you may imagine, static allocation perfectly fits the verb "allocates". You're absolutely right that it's not thread-safe, that's why I don't seriously suggest to revert that commit. On the other hand, I don't remember seeing a bug report of users hitting that issue. So, your heroic effort of fixing it is somewhat misplaced, because the whole socket subsystem is not thread safe. That's because its initial goal was simplicity and small resource usage, it was literally a proof of concept that we can use standards-based API on top of existing adhoc Zephyr API without significant overhead imposed on applications. That worked, and Sockets API was even made the only Zephyr's networking API. Its suggested mode of operation is that a single thread performs all networking activity, while other threads handle sensors, actuators, etc. It definitely can grow from there to add more overheads, like full thread-safety, but that rather be done in planned, concerted way.
@pfalcon passive aggressive stuff like this is not helping your cause.
Well, my cause is helping to promote Zephyr usage, by identifying missing parts and issues, and contributing towards their resolution to user satisfaction. And my cause as hard as users' usage of Zephyr, and that's indeed hard, because its full of random gaps and issues.
Let's take a case here: you identified an issue (lack of thread safety) and fixed it. But your fix opened up another issue, of how well it plays with minlibc. Well, that happens all the time, and quite normal. We've got a user bug report, and your colleagues triaged it, considered different ways to resolve it, chose one which maximizes benefits (getaddrinfo() works OOB, small users malloc() allocations work too, as users who'd use it would expect). But suddenly, you're blocking that solution, and effectively any proper resolution to the issue at all, using arguments like "RAM size increased by an arbitrary choice of 256 bytes". You see, any functionality requires some overhead. And the default state of the system should be that its configuration is minimal and all baseline functionality works (for example, malloc(), for as long as its enabled by default (which it is) should work). And instead, people who shoot for absolutely minimal resource usage should go over their configs and disable anything they don't use (because Zephyr default config isn't optimized for that anyway).
And that's not the only case when you behave like that - you block other incremental improvements regularly, and generally behave as if you don't make any mistakes (but e.g. this ticket is all about your mistake of not testing getaddrinfo() with minlibc after your changes). When consulted regarding areas you fully own, you behave as if your scribblings containing inconsistencies and lacking details, is a kind of scripture, and refuse to answer further questions or process suggestions. So, you should consider who's passive aggressive here.
CC @otavio
read it again and again, and don't see where it says "getaddrinfo() calls malloc()".
If you're going to call this a getaddrinfo() implementation, you are absolutely free to implement another solution that meets the constraint, from the spec, that "The freeaddrinfo() and getaddrinfo() functions shall be thread-safe." https://pubs.opengroup.org/onlinepubs/009695399/functions/freeaddrinfo.html
If not the heap where else would one obtain this memory? It can't be static data due to thread safety, and the API returns a pointer from somewhere not provided by the caller. I can't believe you're arguing this point. Change it however you like as long as the spec is met.
So, your heroic effort of fixing it is somewhat misplaced, because the whole socket subsystem is not thread safe. That's because its initial goal was simplicity and small resource usage, it was literally a proof of concept that we can use standards-based API on top of existing adhoc Zephyr API without significant overhead imposed on applications
Astonishing. Only in your world is thread safety optional when developing OS features. This is a serious bug and a deviation from the spec, not an enhancement to be made later (or lets face it, never at all, given that these socket APIs have been in the kernel for years). Did you know that people are using this "proof on concept" in production?
What's sad about all this is implementing thread safety is not hard. It does not impose undue overhead to instantiate and use locks in the proper places. It just requires careful thought on what the defined scope of the locks are and where to place the critical sections.
you block other incremental improvements regularly
refuse to answer further questions or process suggestions.
This is just you. I came to the conclusion some time ago that it simply not worth my time to deal with you unless you are about to commit something misinformed and/or crazy to the kernel. Pretty much all code you submit to this project has not taken into consideration thread safety, security, whether it actually implements the spec it purports to implement. You are amazingly ill-informed on topics that an OS developer should know, and weigh in forcefully all the time when you don't have idea of what you're talking about. Whine about it all you want, I seriously don't care.