Supervisor: feature request: delay restart before status FATAL

Created on 5 Sep 2014 · 96Comments · Source: Supervisor/supervisor

Use case:

during a software update the server can't restart since some needed python modules are not available yet. Supervisor retries N times (N comes from config. AFAIK defaults to 3).

After N fails the state FATAL gets entered.

In my use case the programm entered state FATAL although just some seconds later the restart would have been successfull.

Can you understand this use case?

A possible solution would be to use a delay in unit "seconds": After N failed retries wait M seconds and then try again. In my case this should be done to infinity (endless loop).

Source

guettli

👍86

Most helpful comment

:+1:

A delaybetweenretries parameter would be useful (bonus points for exponential backoff)

mgalgs on 5 Dec 2014

👍26 ❤2

All 96 comments

:+1:

A delaybetweenretries parameter would be useful (bonus points for exponential backoff)

mgalgs on 5 Dec 2014

👍26 ❤2

:+1: this would be really useful.

allowing us to configure how long between retries would be great. Allowing for an backoff would also help.

i think it would be great if we could also specify a retry count of infinity (never give up) in combination with a sane value for the delay_between_retries.

aztlan2k on 19 Jan 2015

👍2

+1, would be useful.

mmh on 27 Jan 2015

:+1: would be great. There's already a PR with that feature https://github.com/Supervisor/supervisor/pull/509, but it's not automergeable, and there's no discussion on it.

willybarro on 11 Feb 2015

mrook on 10 Mar 2015

nicwest on 18 Mar 2015

arinto on 30 Mar 2015

+1 This would be very useful for me

nereusz on 22 Apr 2015

CheatCodes on 22 Apr 2015

ghost on 21 May 2015

jsmirnov on 23 May 2015

Has someone enough knowledge to create patch for this?

guettli on 23 May 2015

I would love to see this too, so :+1:

dbpolito on 29 May 2015

detailyang on 1 Jun 2015

:+1:

oryband on 8 Jun 2015

Would love to have this feature. :+1:

vincent-io on 12 Jun 2015

👍

siavashs on 12 Jul 2015

Why was this ticket closed? Was the feature request implemented? I can't see a code change on this issue or on #561.

guettli on 13 Jul 2015

👍1

tonicospinelli on 13 Jul 2015

dear sir plz implement this alrdy k thx bai

oryband on 14 Jul 2015

:thumbsup: would like this feature, thanks!

eggsby on 4 Aug 2015

+1, would be useful for me too!

0x20h on 4 Aug 2015

+1, hope this!

caorong on 18 Aug 2015

conanfanli on 28 Aug 2015

dieend on 31 Aug 2015

+1 for delay_between_retries and also exponential backoff would be great.

miso-belica on 1 Sep 2015

pfuender on 16 Sep 2015

toastbrotch on 16 Sep 2015

Let's try with PR #659 it's based on fclairamb's work and includes fixes for the tests.

pfuender on 16 Sep 2015

philpearl on 16 Nov 2015

It is quite useful! Can someone help with that?

SilverBut on 21 Dec 2015

ikb42 on 23 Dec 2015

Doca on 21 Jan 2016

kyle-long on 22 Jan 2016

michal-organek on 10 Feb 2016

draganHR on 12 Feb 2016

ummae on 15 Feb 2016

delay_between_retries would be great

igama on 17 Feb 2016

dmarcantonio on 22 Mar 2016

The documentations says:

When an autorestarting process is in the BACKOFF state, it will be automatically restarted by supervisord. It will switch between STARTING and BACKOFF states until it becomes evident that it cannot be started because the number of startretries has exceeded the maximum, at which point it will transition to the FATAL state. Each start retry will take progressively more time.

Does that mean that this behavior is implemented but not made available? I am confused...

vlsd on 1 Apr 2016

👍5

igtw on 10 Jun 2016

Having this same issue using supervisor to run gogs on kubernetes, I need gogs to wait for mysql to init before crapping out

2016/08/18 21:20:33 [install.go:65 GlobalInit()] [E] Fail to initialize ORM engine: migrate: sync: dial tcp 127.0.0.1:3306: connection refused
2016-08-18 21:20:33,474 INFO exited: gogs (exit status 1; not expected)
2016-08-18 21:20:34,475 INFO gave up: gogs entered FATAL state, too many start retries too quickly

jonathan-kosgei on 19 Aug 2016

mirfilip on 28 Aug 2016

guilhermeadc on 5 Sep 2016

ivancli on 15 Sep 2016

Hacky implementation of this: sleep for x number of seconds after running your command. e.g.

[program:www]
command=bash -c "/var/www/bin/server.sh; sleep 3"

cabloo on 19 Sep 2016

👍12

Napas on 4 Nov 2016

glagola on 10 Nov 2016

johnmarcou on 29 Dec 2016

gonesurfing on 6 Jan 2017

sasounda on 12 Jan 2017

rgrcnh on 19 Jan 2017

👍

csgyuricza on 29 Jan 2017

The "sleep" alternative proposed by @cabloo does won't work because supervisord does not support multiple commands or quotes - not sure why:

command=bash -c "java -version ; sleep 5"

will cause

can't parse command 'bash -c "java -version': No closing quotation

I'm running supervisor 3.3.1

csgyuricza on 30 Jan 2017

👍1

@csgyuricza looks like it's parsing the ; wrong (cutting off at that point). I would try removing the space before it.

cabloo on 30 Jan 2017

@cabloo I tried without spaces as well; doesn't work. It's really annoying that supervisor doesn't allow you to concatenate 2 or more commands together

csgyuricza on 30 Jan 2017

@csgyuricza maybe try || instead of ; (which will only sleep if the command fails)

cabloo on 30 Jan 2017

👍1

@csgyuricza make sure you edit directory=/bin/ as well, and then make the path in -c "/absolute/path/to/script.sh; sleep 5s" - absolute ;-)

Fired up just fine with me!

ssmulders on 31 Jan 2017

I just use sh -c "/mycommand || (sleep 5s && false)" to:

sleep only when process failed
exit with a >0 status code and let supervisor handle the error

jderusse on 7 Apr 2017

👍11

still waiting on this! +1

prafed on 9 May 2017

davefarthing on 15 May 2017

lucas-dall on 15 May 2017

dream91 on 22 May 2017

koprivajakub on 23 May 2017

phaibin on 6 Jun 2017

The BACKOFF state does apparently increase the delay between restart attempts. The docs don't currently explain this. c.f. https://github.com/Supervisor/supervisor/pull/659

Perhaps I'll find the time to submit a pull request for the docs.

jimbrowne on 16 Jun 2017

👍3

gomgomgom on 3 Jul 2017

virusdefender on 21 Jul 2017

bryantebeek on 21 Jul 2017

This is really needed for a process to auto recover from network issue or after external system has recovered.

yixiaol-m on 4 Aug 2017

wardlawp on 23 Aug 2017

garyelephant on 29 Aug 2017

Melvin-mlp on 4 Sep 2017

michallohnisky on 5 Sep 2017

The desired feature can already be achieved with:

[program:x]
...
autorestart=true
startretries=24

This will already add an increasing delay for each retry.
I think this issue can be closed.

0x20h on 10 Sep 2017

👍2

diggerdu on 21 Sep 2017

If you want fixed delay - make the program sleep before running.

[program:xxx]
command=sleep 30 && ...

anatoliykim on 4 Oct 2017

👎7 👍1

"command=sleep 30 && ..." doesn't work - it causes:

sleep: invalid time interval ‘&&’
sleep: invalid time interval ‘/usr/bin/php’
sleep: invalid time interval ‘--some --options’
Try 'sleep --help' for more information.

They made it easy in systemd:

Restart=always
RestartSec=3

And it restarts, and waits 3 secs between restarts.

tchwpkgorg on 20 Oct 2017

While this one works:

[program:www]
command=bash -c "/var/www/bin/server.sh; sleep 3"

It causes quite a nasty output in process list (ps auxf), i.e.:

www-data 13026 0.0 0.0 4512 852 ? S 00:53 0:00 _ sh -c sleep 10; /usr/bin/php worker.php
www-data 13031 0.0 0.0 11736 764 ? S 00:53 0:00 | _ sleep 10
www-data 13027 0.0 0.0 4512 708 ? S 00:53 0:00 _ sh -c sleep 10; /usr/bin/php worker.php
www-data 13033 0.0 0.0 11736 664 ? S 00:53 0:00 | _ sleep 10
www-data 13029 0.0 0.0 4512 756 ? S 00:53 0:00 _ sh -c sleep 10; /usr/bin/php worker.php
www-data 13036 0.0 0.0 11736 820 ? S 00:53 0:00 | _ sleep 10

Also, it basically doubles the number of processes run by a given user. So if you have any monitoring checks which look at a number of processes run by a given user, you may need to make some bigger or smaller adjustments. Also, command line output will be different.

tchwpkgorg on 20 Oct 2017

Several others have mentioned a "backoff" implementation and the BACKOFF state, but only 0x20h mentioned this vital clue (though, I wish he'd been more explicit):

The mechanism that implements startretries does employ a backoff strategy that increases the delay by 1 second with each attempt.

In other words, you can set startretries to a value that is large enough to cover any "expected downtime" in your workflow without causing a self-inflicted DoS scenario.

While it might be nice to have control over the backoff computation, I concur that this has been addressed in as much as a self-inflicted DoS will not result from setting startretries too generously.

cbj4074 on 14 Nov 2017

👍7

can anyone points to where in the docs what 0x20h describes (the "add an increasing delay for each retry") is described? thanks.

nvictor on 4 Dec 2017

@nvictor If it were documented, I doubt any of us would be here. ;) I discovered that the delay is increased by one second with each retry by examining supervisord's log entries.

cbj4074 on 5 Dec 2017

@nvictor @vlsd Has already pointed to this documentation:

http://supervisord.org/subprocess.html#process-states

Each start retry will take progressively more time.

I was also looking for a solution for my problem. I connect to an IMAP server and get a connection timeout. Starting the script after a little delay works great, but supervisor is too quick to restart (for this job).

Therefor a delay as a configuration option would be great to prevent hacks like the 'sleep' option mentioned above (which I will try now as there is not configurable alternative)

rotorsolutions on 2 Jan 2018

So it seems like linear backoff is implemented only (exponential backoff is the other common kind), and only at one, hard-coded rate (1s per retry, as per @cbj4074, but not actually documented). I no longer have a horse in the game. It seems to me like providing the users with an ability to both switch between linear and exponential backoff and set the rate at which the backoff happens would be an ideal solution here. Failing that, documenting that the backoff is linear and set at a rate of 1 second per retry is also a solution. Simply mentioning that backoff happens, with no other details, is confusing.

vlsd on 3 Jan 2018

👍4

I am having the FATAL state issue due to which my supervisor goes down even if the supervisor is running but the workers not processing any jobs.
Did any body have any solution for this ?

BibhuGlobussoft on 1 Feb 2018

Thanks @jderusse

import hashlib
import os
import time
import sys
import datetime

max_backoff = 5
timeout = 60 * 60 * 6

def log(msg):
    print("[proc_wrapper %s] %s" % (str(datetime.datetime.now()), msg))

command = " ".join(sys.argv[1:])
backoff_file = hashlib.md5(command.encode("utf-8")).hexdigest()
log("Running '%s', backoff file name: %s" % (command, backoff_file))

status = os.system(command)
if status != 0:
    if not os.path.exists(backoff_file):
        seconds = -1
    else:
        with open(backoff_file) as f:
            content = f.read().split(":")
            seconds, last_timestamp = int(content[0]), int(content[1])
        if time.time() - last_timestamp > timeout:
            seconds = -1
    if seconds + 1 > max_backoff:
        seconds = max_backoff
    else:
        seconds = seconds + 1
        with open(backoff_file, "w") as f:
            f.write("%d:%d" % (seconds, int(time.time())))
    log("Command '%s' exited with status %d, sleep %ds" % (command, status, seconds))
    time.sleep(seconds)
    exit(1)
else:
    try:
        os.remove(backoff_file)
    except Exception:
        pass

I write a new script to implement this function

then add

killasgroup = true
stopasgroup = true

to [program: x] section

virusdefender on 6 Jun 2018

yongzhang on 15 Jun 2018

estshy on 13 Sep 2018

bramstroker on 14 Dec 2018

Has this problem been solved? How？

zhaodanwjk on 22 Feb 2019

@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.

Chupaka on 23 Feb 2019

👍1

@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.

Thank you very much. We can only wait until the new function is developed or another method is adopted to solve it.

zhaodanwjk on 25 Feb 2019

@zhaodanwjk To solve what? What do you expect? Isn't command=bash -c "runme.sh; sleep 3" (probably with startsecs=0) enough for you?

Chupaka on 25 Feb 2019

❤1

@Chupaka Thank you very much, but I tried this method and could not solve my problem. I need to wait 100 seconds before restarting automatically when the service quits abnormally.

zhaodanwjk on 27 Feb 2019

@zhaodanwjk look at my previous comment https://github.com/Supervisor/supervisor/issues/487#issuecomment-292556138

jderusse on 27 Feb 2019

A better workaround is to use a script that restarts fatal processes by configuring an eventlistener that receives PROCESS_STATE_FATAL events.