Calling the mesh.connect() method causes Mutex 0x2000157c error -6: Not allowed in ISR context.
I'm working on porting mbed-os and getting the network stack working on EFR32FG devices, based on the existing EFR32MG support, but having some troubles with ISRs trying to use the platform critical section mutex in what appears to be the precompiled network stack.
Target
EFR32FG12
Toolchain:
GCC_ARM
Toolchain version:
arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
mbed-cli version:
1.2.0
mbed-os version:
mbed-os-5.4.0-rc1-2137-g9b01a67
Expected behavior
Building with nanostack should hopefully work.
Actual behavior
Error in mutex access from ISR in NanostackRfPhyEfr32.cpp caused by underlying call to platform_enter_critical embedded in libnanostack.a.
Causes error output: Mutex 0x2000157c error -6: Not allowed in ISR context when compiled in debug mode, and nasty startup failures if not.
Steps to reproduce
Build and run ryankurte/node-mbed on an EFR32FG12 target (make f to use JLink adaptor as board does not have mbed file system support).
Optionally add a breakpoint instruction __asm__("#BKPT 01") here to assist with catching the error.
Run debug server and gdb with make ds and make d.
Building with the debug profile (mbed-cli compile --profile mbed-os/tools/profiles/debug.json)
Main looks like:
LoWPANNDInterface mesh;
NanostackRfPhyEfr32 rf_phy;
int main(int argc, char **argv) {
int res;
printf("\nInitialising mbed-os\n");
res = mesh.initialize(&rf_phy);
if (res < 0) {
printf("Mesh error: %d\n", res);
led2 = 1;
}
printf("Mesh initialised, connecting...\n");
res = mesh.connect();
if (res < 0) {
printf("Connect error: %d\n", res);
led2 = 1;
}
printf("Awaiting connection...\n");
while (1) {
led1 = !led1;
wait(0.5);
}
}
Which gives the output:
Initialising mbed-os
Mesh initialised, connecting...
Mutex 0x2000157c error -6: Not allowed in ISR context
Dropping a n __asm__("BKPT #X); into error_msg in rtx_handlers.c to catch that error gives me the following stack trace.
``` │
│73 static const char* error_msg(int32_t status) │
│74 { │
│75 __asm__("BKPT #9"); │
│76 switch (status) {
at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/targets/TARGET_SL_RAIL/NanostackRfPhyEfr32.cpp:706
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)
```
It appears to me that the EFR32 nanostack PHY TX complete callback is calling the nanostack virtual mac, which is then calling the underlying platform_enter_critical function, which attempts to grab the critical section mutex and fails because it's from an ISR. Which afaik seems like a sensible thing to do with a tx complete interrupt.
As far as I can tell, the issue is kindof a duplicate of mbed-os-example-client#284 mbed-os-example-client#303, #4834 and #4904 but as applied to libnanostack which as a closed source blob we cannot reasonable fix :-/
Any ideas on how to solve this? I am working on an update to 5.5, but it appears this issue exists across mbed-os versions.
One of the other issues drops the mutex attempt from the driver, except it's not the driver layer that is attempting it in this case.
Other option I have come up with is to implement a radio thread to wrap the interrupt based phy, but it seems like a bit of a bodge.
Also as a sidenote, this error appearing as a hardfault / causing startup failure in non-debug builds is (imo) very difficult to debug, it might be interesting to wrap this error in a more useful manner for other build types as well as debug.
Thanks,
Ryan
On further investigation it seems that the NCS36510 driver avoids this by running a thread to delegate calls to the underlying driver, which is a little more difficult (lots of data to communicate back from interrupts) but probably achievable. It might also be interesting to make the mac callbacks thread / interrupt safe to avoid the need for this.
Implementing the bare minimum to continue the choose your own adventure nets me:
#0 mesh_system_heap_error_handler (event=NS_DYN_MEM_ALLOCATE_SIZE_NOT_VALID)
at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/mesh_system.c:58
#1 0x00002ff4 in heap_failure (book=0x20003084 <app_stack_heap>, reason=NS_DYN_MEM_ALLOCATE_SIZE_NOT_VALID)
at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:64
#2 0x00003212 in convert_allocation_size (book=0x20003084 <app_stack_heap>, requested_bytes=65462)
at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:174
#3 0x0000329a in ns_mem_internal_alloc (book=0x20003084 <app_stack_heap>, alloc_size=65462, direction=-1)
at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:210
#4 0x00003490 in ns_mem_alloc (heap=0x20003084 <app_stack_heap>, alloc_size=65462)
at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:300
#5 0x000034da in ns_dyn_mem_alloc (alloc_size=65462)
at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:310
#6 0x00034d96 in mac_mlme_data_base_allocate ()
#7 0x0001b3ea in ns_sw_mac_create ()
#8 0x00006b7c in nd_tasklet_network_init (device_id=0 '\000')
at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/nd_tasklet.c:431
#9 0x00005f68 in LoWPANNDInterface::init (this=0x20007954 <mesh>)
at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/LoWPANNDInterface.cpp:67
#10 0x00005ea6 in LoWPANNDInterface::connect (this=0x20007954 <mesh>)
at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/LoWPANNDInterface.cpp:27
#11 0x000156aa in main (argc=0, argv=0x0 <osRegisterForOsEvents>) at ./source/main.cpp:34
(gdb)
Which looks a tad like the stack is trying to allocate 65K to ndp, off to work out how / where to configure that.
cc @SeppoTakalo @stevew817 @akselsm
Seems as if in both this issue and in issue #303 for mbed-os-example-client the problem is due to calling platform_enter_critical() in interrupt context.
Unfortunately, the two call stacks which lead to calling platform_enter_critical() in interrupt context are completely different (seems even that they have nothing in common until platform_enter_critical() ) . The call stack trace for issue #303 can be found here.
So as a follow up to that issue, have poorly hacked the driver to pass signals back to a thread and juggled stack sizes and we're all working.
@betzw it just comes back to not being able to use mutexes in ISRs I guess, as a function of mbed-os not supporting nested interrupts, which is fine if documented / not needed / not used.
Needs to be polished / PR'd, but the solution is in sight. It would be super cool if nanostack was ok with being called from an ISR, and also if it didn't crash horribly if you do do it in any mode but debug, but we're on the way up ^_^
Nanostack cannot be OK when called from ISR context as it used mutexes for protection.
We don't have interrupt safe mutexes in Mbed OS.
Requirement for using the worker thread is documented in Nanostack's driver API https://docs.mbed.com/docs/arm-ipv66lowpan-stack/en/latest/driver_api/
A fix for this is in the works. I'll PR it when ready.
Getting same error on NUCLEOF767ZI
@danielklioc Please open a new ticket for NUCLEO since this is a platform-specific issue. NUCLEOF767ZI doesn't contain a radio, so it is highly unlikely it is the same root cause.