Hi all,
just updated my node to latest version (20181130). But my nodes are in bootloop now.
What happens is that after less than 60 seconds they reboot with Software or Hardware Watchdog.
I was afraid that it was due to my changes in #2057 (just merged) so I started digging into the issue:
1) I discovered that disabling the rules at boot the boot loop was not happening anymore and the nodes were stable.
2) Then I tested previous firmware versions and discovered that in 20181129 and in 20181128 the problem was already present, while in 20181127 the problem was not present.
3) last message in the log just before the WD reboot there is:
EVENT: Clock#Time=Fri,11:24
It seems that the WD reboot happens when there is the first Clock Event.
So this is why the elapsed time between boot and reboot is less than 60 seconds.
So resuming, disabling rules it works, the problem started from 20181128 and the reboot happens just after the Clock Event.
Could it be that this boot loop issue is caused by the this change done on 20181128?
[Commands] No longer run commands from scheduler
Can you post your rules?
It is possible your rules may need some extra stack allocation.
So then it is "introduced" with the fix you linked, but "caused" is not really the right description I would say.
Fact is we are very tight on stack space and running it from the rules instead of the scheduler does add some extra to the stack.
But this is more likely an indication of something we need to address in the rules processing, since it can also happen with other things we run from the rules.
1st node. Rules attached:
rules1.txt
rules2.txt
rules3.txt
rules4.txt
2nd node: does not reboot anymore but minimum Free Stack was 20 bytes:
Free Stack: | 3552 (20 - parseTemplate3)
So it is clear what is causing the crashes.
I was thinking about merging this one first and then from there proceed to look at the rules.
Do you have any idea on what rule it fails?
Everything is normal for me. Just compiled the current changes again. (2:00 pm)
My rules are:
ESPEasy is getting better and better. My Uptimes are now over a month:
Unit: | 1
Local Time: | 2018-11-30 14:17:59
Uptime: | 34 days 3 hours 29 minutes
Load: | 16.20% (LC=22118)
Free Mem: | 12504 (5824 - ruleMatch2)
Free Stack: | 3584 (424 - sendContentBlocking)
Boot: | Cold boot (0)
Reset Reason: | Software/System restart
Last weeks have been very interesting and learned a lot about the internals of ESP8266, Arduino and NonOS.
So I'm also getting more confident in causes of issues and what to do to fix them.
@TD-er
this is the piece of code that creates the crash:
On Rules#Timer=5 do
if %systm_hm%>%sunset% and %v1%=1 and [importStatus#tappSO]>70
event,scene1ON
endif
if %systm_hm%<%sunrise% and %v1%=1 and [importStatus#tappSO]>70
event,scene1ON
endif
if %systm_hm%<%sunset% and %systm_hm%>%sunrise%
event,scene1OFF
endif
endon
This piece of code calls an event (i.e. event1OFF) that calls a set of events (i.e. ReleAction=x,y,z). See full rule attached: rules1.txt
If I call the event from http (http://192.168.88.202/control?cmd=event,scene1OFF) there is no crash.
The crash happens only if I call the event1OFF event from another event (i.e. debugEvent) or through the Timer event above.
Maybe nested events are causing the crash?
I found that the rules engine does allocate a buffer on the stack to parse the lines. With this buffer still allocated, it may call other functions which call the same function again. (recursion)
So that may lead to some stack overflow after only a few calls.
can this be fixed or we should live with it (and my rules should be rewritten avoiding recursion)?
can this be fixed or we should live with it (and my rules should be rewritten avoiding recursion)?
Not sure how much extra stack we need and also how often it enters the recursion.
But you can try it yourself too.
In the function rulesProcessingFile
Change byte buf[RULES_BUFFER_SIZE]; into byte *buf = new byte[RULES_BUFFER_SIZE]();
And don't forget to add delete[] buf; right before leaving the function.
Tested. Improved but not solved.
Node 1: still boot loop
Node 2: improved from 20 to 212bytes:
Free Stack: | 3552 (212 - parseTemplate3)
Shall we apply this small change, even if it doesn't solve?
Shall I open a PR?
I am already working on a PR for these kind of allocations.
What plugin is sending the events?
Openhab
So it is a call-back function?
In the case CPLUGIN_PROTOCOL_RECV ?
Does the command from OpenHAB start with "event" ?
Maybe you can also test my last commit I made to this PR: https://github.com/letscontrolit/ESPEasy/pull/1813
No sorry. I misread your question.
The event is sent either on Clock event, or after a button press (8574) or after a Timer event.
Still would be nice if you can test the pr I linked :)
I will definitely do!
N.B. since I had to eat and bring my daughter to bed, I didn't test that code myself, so its status is that it compiles and I had thought about it while changing ;)
@TD-er
could you create a bin file for PR #1813 ?
Any specific flavour?
8266_normal_4096
thanks
your new build works!
Unit doesn't enter boot loop anymore.
But only 172 bytes free:
Free Stack: | 3600 (172 - ExecuteCommand)
I will wait another day before merging.
Maybe we can make some more optimizations and do some more testing.
For example the function parseTemplate really needs some attention.
It is doing a lot of String re-allocations which are not needed. Not sure what that will do to the stack, but at least its String argument should remain const (and no deep-copy inside the function)
By the way, I also fixed a bug in the switch plugin. The dimmer input value in the webinterface was only partly renamed, so no longer working.
I will wait another day before merging.
Maybe we can make some more optimizations and do some more testing.
For example the functionparseTemplatereally needs some attention.
It is doing a lot of String re-allocations which are not needed. Not sure what that will do to the stack, but at least its String argument should remain const (and no deep-copy inside the function)
Ok
By the way, I also fixed a bug in the switch plugin. The dimmer input value in the webinterface was only partly renamed, so no longer working.
Thanks
Thanks
It was a bug I introduced ;)
And a new test build
This one is including this commit: https://github.com/letscontrolit/ESPEasy/pull/1813/commits/aaae79c8d90fb38084da494c2495c48d608d5231
Free RAM has increased, free stack has increased and the lowest free stack value is about 400 bytes better now.
Please test this one thoroughly, since all getArgv functions have changed and also a lot of stack allocated stuff has been removed.
Hi @TD-er
with latest build free stack raised from 172 to 524.
Good work!
Also the free RAM has been increased a bit. At least for a fully populated node with "Test" build running, the amount of free RAM is again usable.
Most helpful comment
Hi @TD-er
with latest build free stack raised from 172 to 524.
Good work!