Use case:
during a software update the server can't restart since some needed python modules are not available yet. Supervisor retries N times (N comes from config. AFAIK defaults to 3).
After N fails the state FATAL gets entered.
In my use case the programm entered state FATAL although just some seconds later the restart would have been successfull.
Can you understand this use case?
A possible solution would be to use a delay in unit "seconds": After N failed retries wait M seconds and then try again. In my case this should be done to infinity (endless loop).
:+1:
A delaybetweenretries parameter would be useful (bonus points for exponential backoff)
:+1: this would be really useful.
allowing us to configure how long between retries would be great. Allowing for an backoff would also help.
i think it would be great if we could also specify a retry count of infinity (never give up) in combination with a sane value for the delay_between_retries.
+1, would be useful.
:+1: would be great. There's already a PR with that feature https://github.com/Supervisor/supervisor/pull/509, but it's not automergeable, and there's no discussion on it.
+1
+1
+1
+1 This would be very useful for me
+1
+1
+1
Has someone enough knowledge to create patch for this?
I would love to see this too, so :+1:
+1
:+1:
Would love to have this feature. :+1:
👍
Why was this ticket closed? Was the feature request implemented? I can't see a code change on this issue or on #561.
+1
dear sir plz implement this alrdy k thx bai
:thumbsup: would like this feature, thanks!
+1, would be useful for me too!
+1, hope this!
+1
+1
+1 for delay_between_retries and also exponential backoff would be great.
+1
+1
Let's try with PR #659 it's based on fclairamb's work and includes fixes for the tests.
+1
It is quite useful! Can someone help with that?
+1
+1
+1
+1
+1
+1
+1
delay_between_retries would be great
+1
The documentations says:
When an autorestarting process is in the
BACKOFFstate, it will be automatically restarted by supervisord. It will switch betweenSTARTINGandBACKOFFstates until it becomes evident that it cannot be started because the number ofstartretrieshas exceeded the maximum, at which point it will transition to theFATALstate. Each start retry will take progressively more time.
Does that mean that this behavior is implemented but not made available? I am confused...
+1
Having this same issue using supervisor to run gogs on kubernetes, I need gogs to wait for mysql to init before crapping out
2016/08/18 21:20:33 [install.go:65 GlobalInit()] [E] Fail to initialize ORM engine: migrate: sync: dial tcp 127.0.0.1:3306: connection refused
2016-08-18 21:20:33,474 INFO exited: gogs (exit status 1; not expected)
2016-08-18 21:20:34,475 INFO gave up: gogs entered FATAL state, too many start retries too quickly
+1
+1
+1
Hacky implementation of this: sleep for x number of seconds after running your command. e.g.
[program:www]
command=bash -c "/var/www/bin/server.sh; sleep 3"
+1
+1
+1
+1
+1
+1
👍
The "sleep" alternative proposed by @cabloo does won't work because supervisord does not support multiple commands or quotes - not sure why:
command=bash -c "java -version ; sleep 5"
will cause
can't parse command 'bash -c "java -version': No closing quotation
I'm running supervisor 3.3.1
@csgyuricza looks like it's parsing the ; wrong (cutting off at that point). I would try removing the space before it.
@cabloo I tried without spaces as well; doesn't work. It's really annoying that supervisor doesn't allow you to concatenate 2 or more commands together
@csgyuricza maybe try || instead of ; (which will only sleep if the command fails)
@csgyuricza make sure you edit directory=/bin/ as well, and then make the path in -c "/absolute/path/to/script.sh; sleep 5s" - absolute ;-)
Fired up just fine with me!
I just use sh -c "/mycommand || (sleep 5s && false)" to:
still waiting on this! +1
+1
+1
+3
+1
+1
The BACKOFF state does apparently increase the delay between restart attempts. The docs don't currently explain this. c.f. https://github.com/Supervisor/supervisor/pull/659
Perhaps I'll find the time to submit a pull request for the docs.
+1
+1
+1
This is really needed for a process to auto recover from network issue or after external system has recovered.
+1
+1
+1
+1
The desired feature can already be achieved with:
[program:x]
...
autorestart=true
startretries=24
This will already add an increasing delay for each retry.
I think this issue can be closed.
+1
If you want fixed delay - make the program sleep before running.
[program:xxx]
command=sleep 30 && ...
"command=sleep 30 && ..." doesn't work - it causes:
sleep: invalid time interval ‘&&’
sleep: invalid time interval ‘/usr/bin/php’
sleep: invalid time interval ‘--some --options’
Try 'sleep --help' for more information.
They made it easy in systemd:
Restart=always
RestartSec=3
And it restarts, and waits 3 secs between restarts.
While this one works:
[program:www]
command=bash -c "/var/www/bin/server.sh; sleep 3"
It causes quite a nasty output in process list (ps auxf), i.e.:
www-data 13026 0.0 0.0 4512 852 ? S 00:53 0:00 _ sh -c sleep 10; /usr/bin/php worker.php
www-data 13031 0.0 0.0 11736 764 ? S 00:53 0:00 | _ sleep 10
www-data 13027 0.0 0.0 4512 708 ? S 00:53 0:00 _ sh -c sleep 10; /usr/bin/php worker.php
www-data 13033 0.0 0.0 11736 664 ? S 00:53 0:00 | _ sleep 10
www-data 13029 0.0 0.0 4512 756 ? S 00:53 0:00 _ sh -c sleep 10; /usr/bin/php worker.php
www-data 13036 0.0 0.0 11736 820 ? S 00:53 0:00 | _ sleep 10
Also, it basically doubles the number of processes run by a given user. So if you have any monitoring checks which look at a number of processes run by a given user, you may need to make some bigger or smaller adjustments. Also, command line output will be different.
Several others have mentioned a "backoff" implementation and the BACKOFF state, but only 0x20h mentioned this vital clue (though, I wish he'd been more explicit):
The mechanism that implements startretries does employ a backoff strategy that increases the delay by 1 second with each attempt.
In other words, you can set startretries to a value that is large enough to cover any "expected downtime" in your workflow without causing a self-inflicted DoS scenario.
While it might be nice to have control over the backoff computation, I concur that this has been addressed in as much as a self-inflicted DoS will not result from setting startretries too generously.
can anyone points to where in the docs what 0x20h describes (the "add an increasing delay for each retry") is described? thanks.
@nvictor If it were documented, I doubt any of us would be here. ;) I discovered that the delay is increased by one second with each retry by examining supervisord's log entries.
@nvictor @vlsd Has already pointed to this documentation:
http://supervisord.org/subprocess.html#process-states
Each start retry will take progressively more time.
I was also looking for a solution for my problem. I connect to an IMAP server and get a connection timeout. Starting the script after a little delay works great, but supervisor is too quick to restart (for this job).
Therefor a delay as a configuration option would be great to prevent hacks like the 'sleep' option mentioned above (which I will try now as there is not configurable alternative)
So it seems like linear backoff is implemented only (exponential backoff is the other common kind), and only at one, hard-coded rate (1s per retry, as per @cbj4074, but not actually documented). I no longer have a horse in the game. It seems to me like providing the users with an ability to both switch between linear and exponential backoff and set the rate at which the backoff happens would be an ideal solution here. Failing that, documenting that the backoff is linear and set at a rate of 1 second per retry is also a solution. Simply mentioning that backoff happens, with no other details, is confusing.
I am having the FATAL state issue due to which my supervisor goes down even if the supervisor is running but the workers not processing any jobs.
Did any body have any solution for this ?
Thanks @jderusse
import hashlib
import os
import time
import sys
import datetime
max_backoff = 5
timeout = 60 * 60 * 6
def log(msg):
print("[proc_wrapper %s] %s" % (str(datetime.datetime.now()), msg))
command = " ".join(sys.argv[1:])
backoff_file = hashlib.md5(command.encode("utf-8")).hexdigest()
log("Running '%s', backoff file name: %s" % (command, backoff_file))
status = os.system(command)
if status != 0:
if not os.path.exists(backoff_file):
seconds = -1
else:
with open(backoff_file) as f:
content = f.read().split(":")
seconds, last_timestamp = int(content[0]), int(content[1])
if time.time() - last_timestamp > timeout:
seconds = -1
if seconds + 1 > max_backoff:
seconds = max_backoff
else:
seconds = seconds + 1
with open(backoff_file, "w") as f:
f.write("%d:%d" % (seconds, int(time.time())))
log("Command '%s' exited with status %d, sleep %ds" % (command, status, seconds))
time.sleep(seconds)
exit(1)
else:
try:
os.remove(backoff_file)
except Exception:
pass
I write a new script to implement this function
then add
killasgroup = true
stopasgroup = true
to [program: x] section
+1
+1
+1
Has this problem been solved? How?
@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.
@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.
Thank you very much. We can only wait until the new function is developed or another method is adopted to solve it.
@zhaodanwjk To solve what? What do you expect? Isn't command=bash -c "runme.sh; sleep 3" (probably with startsecs=0) enough for you?
@Chupaka Thank you very much, but I tried this method and could not solve my problem. I need to wait 100 seconds before restarting automatically when the service quits abnormally.
@zhaodanwjk look at my previous comment https://github.com/Supervisor/supervisor/issues/487#issuecomment-292556138
A better workaround is to use a script that restarts fatal processes by configuring an eventlistener that receives PROCESS_STATE_FATAL events.
Most helpful comment
:+1:
A
delaybetweenretriesparameter would be useful (bonus points for exponential backoff)