Arduino: Watchdog managed by sketch only

Created on 29 Jan 2016  Â·  30Comments  Â·  Source: esp8266/Arduino

I'm wondering if there is any way to manage watchdog ONLY by the user sketch ?
From my point of view, this kind of "feed the dog" should be done by user and never by any lib or sdk calls (or explicitly indicated). But as today I'm unable to know where and when it's feeded?

Of course we can let it as it, I understand most users don't want to deal with it, but I'm working on sensitive production sites and I need to be able to manage it myself.
Why simply because if any library/sdk routine and whatever is feeding the dog in any kind of loop/for/while/delay..., It can continue to feed it while program is being blocked.
I can't do assumption that it will be fired in all case and cross my fingers on thinking code will do it right. Don't ask me how it can happens (bad call, bad parameters, interference, whatever, ..) but trust me, it happens and I already saw it, sketch blocked needed to be restarted by manual reset, Imagine me calling customer please could you please reset the device I sold ?
And no, I don't feed the dog in my sketch or my lib so it wasn't in my code at all.

Not sure it's possible but knowing where it's done (Lib or SDK) could help us to understand an track.
And last but not least, may be worth adding external WDT chip.

Any ideas are welcome

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

core easy enhancement

Most helpful comment

if you not want to hack the sdk at Assembler level its not possible.
but you can create a watchdog like functionality for the "arduino loop".
I have running this in all my deployed ESPs to prevent a deadlock like you describe:

#include <Ticker.h>
Ticker tickerOSWatch;

#define OSWATCH_RESET_TIME 30

static unsigned long last_loop;

void ICACHE_RAM_ATTR osWatch(void) {
    unsigned long t = millis();
    unsigned long last_run = abs(t - last_loop);
    if(last_run >= (OSWATCH_RESET_TIME * 1000)) {
      // save the hit here to eeprom or to rtc memory if needed
        ESP.restart();  // normal reboot 
        //ESP.reset();  // hard reset
    }
}

void setup() {
    last_loop = millis();
    tickerOSWatch.attach_ms(((OSWATCH_RESET_TIME / 3) * 1000), osWatch);
}

void loop() {
    last_loop = millis();
   while(1)   delay(1000);  // this will trigger the os watch
}

a full lock of the ESP will be handled by the HW watchdog.
a soft lock of the SDK will trigger the SW watchdog.
a loop lock will be covert by the "OSWatch".

All 30 comments

if you not want to hack the sdk at Assembler level its not possible.
but you can create a watchdog like functionality for the "arduino loop".
I have running this in all my deployed ESPs to prevent a deadlock like you describe:

#include <Ticker.h>
Ticker tickerOSWatch;

#define OSWATCH_RESET_TIME 30

static unsigned long last_loop;

void ICACHE_RAM_ATTR osWatch(void) {
    unsigned long t = millis();
    unsigned long last_run = abs(t - last_loop);
    if(last_run >= (OSWATCH_RESET_TIME * 1000)) {
      // save the hit here to eeprom or to rtc memory if needed
        ESP.restart();  // normal reboot 
        //ESP.reset();  // hard reset
    }
}

void setup() {
    last_loop = millis();
    tickerOSWatch.attach_ms(((OSWATCH_RESET_TIME / 3) * 1000), osWatch);
}

void loop() {
    last_loop = millis();
   while(1)   delay(1000);  // this will trigger the os watch
}

a full lock of the ESP will be handled by the HW watchdog.
a soft lock of the SDK will trigger the SW watchdog.
a loop lock will be covert by the "OSWatch".

Wohhhh excellent, I will implement this in my sketch, many thanks
and happy I'm not the only one looking for reliable and unlockable devices ;-)

+1

Maybe one could put the code into a simple libary so that it can be reused more easily.

Markus not use it when you develop !!! How do you see a deadlock

Thanks i hope reuse this function to prevent lock of devices

@asetyde i use it only in my deployed ESPs,
for development i have a other variant that only give me warnings and i use it to tracks the heap too.

Hummm, for sure this variant would be definitely interesting!

void ICACHE_RAM_ATTR osWatch(void) {
    unsigned long t = millis();
    unsigned long last_run = abs(t - last_loop);

    os_printf("[osWatch] last_run: %d FreeRam: %d rssi: %d\n", last_run, ESP.getFreeHeap(), WiFi.RSSI());
    if(last_run >= (OSWATCH_RESET_TIME * 1000)) {
        os_printf("[osWatch] WARNING loop Blocked! check for deadlock!\n");
    }
}

Note: it can theoretical crash if the flash is blocked and the used functions are not in cache any more.
but I never have seen this behavior, if it will at some point the stack will show it me ;)

OOK , good work , i take it , more Watchdog philosophy is better ! :)

Regarding watchdogs already present in the code... We have a hardware timer, controlled via registers, and a software watchdog serviced by on of the hardware timers dedicated to WiFi stack. Both are normally active. Hardware timer is kicked at two points in the SDK code, and some trickery is required to override that behavior (read: replace instructions with nops). Software one is easier to get hold of — we can wrap pp_soft_wdt_feed function at linking step and replace it with a no-op, leaving it to the user to call the real feed function. It's not hard to implement, just a few lines of code.
The only concern is that there are a few APIs in the core and libraries which are blocking. I.e. you have blocking reads and writes with WiFi client and SPIFFS. These may take unpredictable amount of time, without giving the sketch a chance to feed the watchdog. So I'm not sure that exposing the watchdog to user and cutting it from the SDK will be immediately useful. However, comments are welcome. I would definitely love to have at least some wdt APIs exposed.

update : @Links2004 solution give me a good stability to escape some bugs and particular situations in that esp freeze , no freeze it has seen .

it can be closed ? this homemade watchdog is perfect for me

Hi guys

I'm not particularly good with the programming side of things, so this may seem a really stupid, but I have an application that intermittently locks up and this seems like it might be the solution - do I put my existing loop() code _before or after_ the line "while(1) delay(1000); // this will trigger the os watch"???

Thanks

"while(1) delay(1000); // this will trigger the os watch"
this is just a Test code to test if the SoftwareWTD triggers. you shouldnt use this in your code.

and you have to put your code behind this line...

As far as I understand, delay(x); refresh watchdog so following code (same as your) will not trigger

while(1) {
  delay(1000);
}

I'm using this one to trigger it.

while(1);

Hello.
I'm still uncertain about the usage of this watchdog script.

Is the code "while(1) delay(1000); // this will trigger the os watch" required ???
Or is "while(1);" required ?
Is is really required to add anything under void loop() ?

Tx for your answers as this gives strange behaviours and I can't really check if it is working...or not :-)

this is only test, you shouldn't use this in your code

Tx so .... to be clear this part may safely be deleted, right ?

void loop() {
last_loop = millis();
while(1) delay(1000); // this will trigger the os watch

}

And a second comment....in my case (and without entries in void loop() (see just above), this watchdog triggers every 10 seconds.

I've replaced the "ESP.restart();" by "Serial.print("Watchdog Ticker");" and it constantly triggers...

It looks normal as it fits the "OSWATCH_RESET_TIME / 3) * 1000"...
So in my case it is not triggered by a monitoring, it is simply triggered every 10 sec....

Any idea could be helpfull.
Tx !

Hi,

I am usign this for some time:

declaration
/****** WATCHDOG **/
// USE last_loop=millis(); to trigger the watchdog
static unsigned long last_loop;
void ICACHE_RAM_ATTR osWatch(void) {
unsigned long t = millis();
unsigned long last_run = abs(t - last_loop); // difference between
last recorded Kick and current check
if(last_run >= (OSWATCH_RESET_TIME * 1000)) {
// save the hit here to eeprom or to rtc memory if needed
ESP.restart(); // normal reboot
}
}

in the setup
void setup() {
last_loop = millis();
tickerOSWatch.attach_ms(((OSWATCH_RESET_TIME / 3) * 1000), osWatch);

and in the loop
last_loop=millis(); //kick the watchdog

regards
Cor

On 24 May 2017 at 09:00, sylo1971 notifications@github.com wrote:

And a second comment....in my case (and without entries in void loop()
(see just above), this watchdog triggers every 10 seconds.

I've replaced the "ESP.restart();" by "Serial.print("Watchdog Ticker");"
and it constantly triggers...

Any idea could be helpfull.
Tx !

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/1532#issuecomment-303635816,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRKYIGSiMhRYsyF7Jnn-hzL1NQeAtjVks5r89V8gaJpZM4HPR3M
.

--


RegiStax Free Image Processing Software
http://www.astronomie.be/registax


Tx ! I just tested it....same result for me. It triggers every 10sec

using exactly the same code ? that would point towards something that holds
your app up too long before "kicking" the watchdog.

On 24 May 2017 at 18:21, sylo1971 notifications@github.com wrote:

Tx ! I just tested it....same result for me. It triggers every 10sec

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/1532#issuecomment-303776026,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRKYGdOgTRvDcUJCsQOSiU2Eb_ckNQEks5r9FkbgaJpZM4HPR3M
.

--


RegiStax Free Image Processing Software
http://www.astronomie.be/registax


Indeed...I don't get it....

can you share your code ?

On 24 May 2017 at 21:18, sylo1971 notifications@github.com wrote:

Indeed...I don't get it....

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/1532#issuecomment-303824531,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRKYCDepakUOSSdv7KvvaTooNunykaBks5r9IKSgaJpZM4HPR3M
.

--


RegiStax Free Image Processing Software
http://www.astronomie.be/registax


With pleasure :-)
Here you go !

include

include

include

define USERNAME "yyyyyyy"

define DEVICE_ID "xxxxxxx"

define DEVICE_CREDENTIAL "zzzzzzz"

ThingerWifi thing(USERNAME, DEVICE_ID, DEVICE_CREDENTIAL);

include

include

include

/****** WATCHDOG **/

define OSWATCH_RESET_TIME 30

Ticker tickerOSWatch;
// USE last_loop=millis(); to trigger the watchdog
static unsigned long last_loop;
void ICACHE_RAM_ATTR osWatch(void) {
unsigned long t = millis();
unsigned long last_run = abs(t - last_loop); // difference between last recorded Kick and current check
if(last_run >= (OSWATCH_RESET_TIME * 1000)) {
// save the hit here to eeprom or to rtc memory if needed
//ESP.restart(); // normal reboot
Serial.print("Watchdog");
}
}

// Data wire is plugged into port GPIO2 = D4 on the ESP8266-12

define ONE_WIRE_BUS 2

// Setup a oneWire instance to communicate with any OneWire devices (not just Maxim/Dallas temperature ICs)
OneWire oneWire(ONE_WIRE_BUS);

// Pass our oneWire reference to Dallas Temperature.
DallasTemperature sensors(&oneWire);

//how many clients should be able to telnet to this ESP8266

define MAX_SRV_CLIENTS 2

const char* ssid = "zzzzz_2.4GHz";
const char* password = "fffffffffff";

WiFiServer server(11756);
WiFiClient serverClients[MAX_SRV_CLIENTS];

void setup() {

//watchdog (Ticker)
last_loop = millis();
tickerOSWatch.attach_ms(((OSWATCH_RESET_TIME / 3) * 1000), osWatch);

//start UART and the server
Serial.begin(9600);
Serial.swap();
server.begin();
server.setNoDelay(true);
IPAddress ESP8266_ip ( 192, 168, 2, 155);
IPAddress dns_ip ( 192, 168, 2, 1);
IPAddress gateway_ip ( 192, 168, 2, 1);
IPAddress subnet_mask(255, 255, 255, 0);

WiFi.config(ESP8266_ip, gateway_ip, subnet_mask);
//WiFi.config(ESP8266_ip, gateway_ip, subnet_mask, dns_ip);

WiFi.mode(WIFI_STA);
WiFi.begin(ssid, password);
//Serial.print("\nConnecting to "); Serial.println(ssid);
uint8_t i = 0;
while (WiFi.status() != WL_CONNECTED && i++ < 20) delay(100);
if(i == 21){
//Serial.print("Could not connect to"); Serial.println(ssid);
while(1) delay(500);
}

//added by Phil*DEBUT*Thinger Tension********
// list all the different sensors with "thing"
{
pinMode(A0, INPUT);

// resource Tension
thing["Tension"] >> [](pson & out){

       out = (unsigned int) analogRead(A0) * 0.0241796875 * 1.0169;
};

//-----------------Thinger Temperature------------------------
thing["Temperature"] >> [](pson & out) {
sensors.requestTemperaturesByIndex(0);
out = sensors.getTempCByIndex(0);

  //sensors.requestTemperatures(); // Send the command to get temperatures
  //out["Temperature"] = sensors.getTempCByIndex(0);

};
sensors.begin();

//-------------------------------------

}

//added by Phil*****FIN*******
}

void loop() {

last_loop=millis(); //kick the watchdog

uint8_t i;
//check if there are any new clients
if (server.hasClient()){
for(i = 0; i < MAX_SRV_CLIENTS; i++){
//find free/disconnected spot
if (!serverClients[i] || !serverClients[i].connected()){
if(serverClients[i]) serverClients[i].stop();
serverClients[i] = server.available();
// Serial.print("New client: "); Serial.print(i);
continue;
}
}
//no free/disconnected spot so reject
WiFiClient serverClient = server.available();
serverClient.stop();

}
//check clients for data
for(i = 0; i < MAX_SRV_CLIENTS; i++){
if (serverClients[i] && serverClients[i].connected()){
if(serverClients[i].available()){
//get data from the telnet client and push it to the UART
while(serverClients[i].available()) Serial.write(serverClients[i].read());
}
}
}
//check UART for data
if(Serial.available()){
size_t len = Serial.available();
uint8_t sbuf[len];
Serial.readBytes(sbuf, len);
//push UART data to all connected telnet clients
for(i = 0; i < MAX_SRV_CLIENTS; i++){
if (serverClients[i] && serverClients[i].connected()){
serverClients[i].write(sbuf, len);
delay(100);
}
}
}

  //added by Phil********************************************

{
thing.handle();
}
//added by Phil**************
}

It's mainly the IDE proposed Serial2WIFI sketch integrated with Thinger.
I've added the watchdog script...

PS: and the reason is that the Serial2WIFI sketch seems to hang after a few hours

Hi,

One of the things that might happen:

Case 1:
//check clients for data
for(i = 0; i < MAX_SRV_CLIENTS; i++){
if (serverClients[i] && serverClients[i].connected()){
if(serverClients[i].available()){
//get data from the telnet client and push it to the UART
while(serverClients[i].available()) Serial.write(serverClients[i].read());
}
}
}

Remark:If the client sends its data and you read it ... could it take in
total over 10 seconds ? If so it will trigger the watchdog. So for routines
that can last longer than your interval you will need to kick the watchdog
inside the routine also by adding a last_loop=millis();

Case2:
//check UART for data
if(Serial.available()){
size_t len = Serial.available();
uint8_t sbuf[len];
Serial.readBytes(sbuf, len);
//push UART data to all connected telnet clients
for(i = 0; i < MAX_SRV_CLIENTS; i++){
if (serverClients[i] && serverClients[i].connected()){
serverClients[i].write(sbuf, len);
delay(100);
}
*Remark: *As above, in this case even more chance for a long processing
time. First you need to wait till all the bytes over serial are in and then
you push them out to the clients. Could it take more than 10 seconds ?

On 24 May 2017 at 21:27, sylo1971 notifications@github.com wrote:

It's mainly the IDE proposed Serial2WIFI sketch integrated with Thinger.
I've added the watchdog script...

PS: and the reason is that the Serial2WIFI sketch seems to hang after a
few hours

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/1532#issuecomment-303826774,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRKYGZnfv47amtZ3H1FBX94aktVbXlNks5r9IS1gaJpZM4HPR3M
.

--


RegiStax Free Image Processing Software
http://www.astronomie.be/registax


Tx for your time and attention !
Regarding your "Case 1" remark, yes it could. I have a logger constantly sending requests and collecting answers at a rate of 2/sec.
The reason why I'm trying to include a watchdog is that this seems to halt/block/freeze after a few hours.

Hello,

First I'm not sure it is the right place to post my questions...but anyway (CorBer, is there a way to communicate directly ?).

Now I've included the code as follows
`void loop() {

uint8_t i;
//check if there are any new clients
if (server.hasClient()){
last_loop=millis(); //kick the watchdog
for(i = 0; i < MAX_SRV_CLIENTS; i++){
//find free/disconnected spot
if (!serverClients[i] || !serverClients[i].connected()){
if(serverClients[i]) serverClients[i].stop();
serverClients[i] = server.available();
// Serial.print("New client: "); Serial.print(i);
continue;
}
}
//no free/disconnected spot so reject
WiFiClient serverClient = server.available();
serverClient.stop();`

I monitor my serial to ensure I don't receive anything....all empty
Then, I send 1 chain to the serial and count....the watchdog triggers 10sec after the start of the loop. Not 10sec after my last sent command.

In fact I'm now wondering if the watchdog approach is the right one.
The issue I'm trying to solve is that my TCP wifi server seems to hang after a while and I don't know why.
Wouldn't it be better to check for data from the "telnet" clients and....if none received in the last 10 minutes, restart the server ????
How could I do this ????

Hi,

You can contact me via
registax

at

gmail

dot

com

On 26 May 2017 at 09:33, sylo1971 notifications@github.com wrote:

Hello,

First I'm not sure it is the right place to post my questions...but anyway
(CorBer, is there a way to communicate directly ?).

Now I've included the code as follows
`void loop() {

//last_loop=millis(); //kick the watchdog

uint8_t i;
//check if there are any new clients
if (server.hasClient()){
last_loop=millis(); //kick the watchdog
for(i = 0; i < MAX_SRV_CLIENTS; i++){
//find free/disconnected spot
if (!serverClients[i] || !serverClients[i].connected()){
if(serverClients[i]) serverClients[i].stop();
serverClients[i] = server.available();
// Serial.print("New client: "); Serial.print(i);
continue;
}
}
//no free/disconnected spot so reject
WiFiClient serverClient = server.available();
serverClient.stop();`

I monitor my serial to ensure I don't receive anything....all empty
Then, send 1 chain to the serial and count....the watchdog triggers 10sec
after the start of the loop. Not 10sec after my last sent command.

In fact I'm now wondering if the watchdog approach is the right one.
The issue I'm trying to solve is that my wifi server seems to hang after a
while and I don't know why.
Wouldn't it be better to check for data from the "telnet" clients
and....if none received in the last 10 minutes, restart the server ????
How could I do this ????

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/1532#issuecomment-304213555,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRKYJDIn4cRwgVBxTl2Hxeo8COe6S7Xks5r9oBVgaJpZM4HPR3M
.

--


RegiStax Free Image Processing Software
http://www.astronomie.be/registax


Reverse engineering of the ESP8266 watchdog timer

Thought this might be helpful!

https://mongoose-os.com/blog/esp8266-watchdog-timer/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

treii28 picture treii28  Â·  3Comments

hoacvxd picture hoacvxd  Â·  3Comments

hulkco picture hulkco  Â·  3Comments

Geend picture Geend  Â·  3Comments

eliabieri picture eliabieri  Â·  3Comments