Supervisor: autorestart does not work as advertised

Created on 30 Jan 2013 · 25Comments · Source: Supervisor/supervisor

I'm running 3.0b1 and noticed the following error. With the program:

[program:foo]
command=ls
autorestart=False

The process repeatedly restarts

2013-01-29 18:21:10,434 INFO RPC interface 'supervisor' initialized
2013-01-29 18:21:10,434 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2013-01-29 18:21:10,435 INFO supervisord started with pid 18680
2013-01-29 18:21:11,437 INFO spawned: 'foo' with pid 18941
2013-01-29 18:21:11,447 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:21:12,449 INFO spawned: 'foo' with pid 18943
2013-01-29 18:21:12,461 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:21:14,464 INFO spawned: 'foo' with pid 18966
2013-01-29 18:21:14,473 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:21:17,476 INFO spawned: 'foo' with pid 18969
2013-01-29 18:21:17,484 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:21:18,485 INFO gave up: foo entered FATAL state, too many start retries too quickly

The unexpected options seems to fail too:

[program:foo]
command=ls
autorestart=unexpected
exitcodes=0

2013-01-29 18:22:48,026 INFO RPC interface 'supervisor' initialized
2013-01-29 18:22:48,027 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2013-01-29 18:22:48,027 INFO supervisord started with pid 19955
2013-01-29 18:22:49,029 INFO spawned: 'foo' with pid 19960
2013-01-29 18:22:49,038 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:22:50,041 INFO spawned: 'foo' with pid 19961
2013-01-29 18:22:50,048 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:22:52,050 INFO spawned: 'foo' with pid 19962
2013-01-29 18:22:52,058 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:22:55,063 INFO spawned: 'foo' with pid 19972
2013-01-29 18:22:55,072 INFO exited: foo (exit status 0; not expected)
2013-01-29 18:22:56,073 INFO gave up: foo entered FATAL state, too many start retries too quickly

Source

srwilson

Most helpful comment

Need to ensure startsecs = 0:

[program:foo]
command = ls
startsecs = 0
autorestart = false

http://supervisord.org/configuration.html

startsecs

The total number of seconds which the program needs to stay running after a startup to consider the start successful. If the program does not stay up for this many seconds after it has started, even if it exits with an “expected” exit code (see exitcodes), the startup will be considered a failure. Set to 0 to indicate that the program needn’t stay running for any particular amount of time.

jdeathe on 3 Jul 2014

👍17 😄2 ❤1

All 25 comments

I believe it's because of the startretries setting. After reaching X number of retries, the state will become FATAL.
http://supervisord.org/configuration.html#program-x-section-settings

dexterbt1 on 30 Jan 2013

Yes setting startretries to 0 will cause the process to never restart. This is the workaround I am using. But autorestart and exitcodes should work as one would expect otherwise what is the point of having them?

srwilson on 30 Jan 2013

Same bug here, exitcodes is completely ignored (default to 0).

[program:myworker]
command=...
autorestart=true

...
supervisord.log: exited: myworker (exit status 0; not expected)
supervisord.log: gave up: myworker entered FATAL state, too many start retries too quickly

The exit status 0 shouldn't be not expected since it is by default (http://supervisord.org/configuration.html#program-x-section-settings).

mnapoli on 15 Jan 2014

👍1

Just curios, but does this happen with processes that daemon-ize and start in less than a second(the default value of startsecs=1)

smschauhan on 21 Jan 2014

@smschauhan It happens for me why my worker indeed starts and exits in less than 1s.

What the worker does is: start, wait for a task, process it, exit. If there are severals tasks queued, then the worker might restart 3 times, each time taking less than 1 second. So supervisor will mark it as "failed" and will not restart it again.

mnapoli on 21 Jan 2014

(my current workaround is to have wait(1) in my worker to wait 1 second, which is not really good)

mnapoli on 21 Jan 2014

👍1

analytically on 29 Apr 2014

Having the same issue:

[program:configure]
command=python /root/configure.py
autorestart=unexpected
priority=0
exitcodes=0

When I run it, I get this:
INFO exited: configure (exit status 0; not expected) and it attempts to restart the process.

So it seems that exitcodes are completely ignored

aventurella on 30 Apr 2014

kevin-buttercoin on 29 May 2014

Need to ensure startsecs = 0:

[program:foo]
command = ls
startsecs = 0
autorestart = false

http://supervisord.org/configuration.html

startsecs

The total number of seconds which the program needs to stay running after a startup to consider the start successful. If the program does not stay up for this many seconds after it has started, even if it exits with an “expected” exit code (see exitcodes), the startup will be considered a failure. Set to 0 to indicate that the program needn’t stay running for any particular amount of time.

jdeathe on 3 Jul 2014

👍17 😄2 ❤1

I also got hit by this. The output is confusing! It should say that it exited too soon, not that 0 was unexpected!

timlesallen on 13 Jan 2015

I had to set startsecs=0 in docker to get anything to start successfully with supervisor. Otherwise it would seemingly retry 'very' quickly and endlessly even though the original ones were actually running. Running 3.0b2-1 on ubuntu. I can't find evidence trying to set autorestart=false did anything at all.

vinceskahan on 21 Jan 2015

Shellbye on 28 Jan 2015

florinbroasca on 7 Feb 2015

ckrybus on 2 Mar 2015

+1 Same issue here in docker.
exit codes list is completely ignored while using autorestart=unexpected

EthanSbbn on 24 Mar 2015

liverbool on 5 Apr 2015

_EDIT: even though the exit code is 0, the program is daemonizing itself_
this is not how supervisor expects a program to behave, it should run in the foreground
see http://supervisord.org/subprocess.html#nondaemonizing-of-subprocesses
END EDIT

+1 on docker
below is an example for rsyslog

2015-06-30T09:45:28.622083575Z 2015-06-30 09:45:28,621 INFO exited: rsyslogd (exit status 0; not expected)

2015-06-30T09:45:29.623299820Z 2015-06-30 09:45:29,623 INFO gave up: rsyslogd entered FATAL state, too many start retries too quickly

with supervisor config:
[supervisord]
nodaemon=true

[program:rsyslogd]
command=/usr/sbin/rsyslogd
autostart=true
autorestart=false
startretries=0

and in the docker instance the running processes:
root@5d8962193676:/var/log# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 47296 12676 ? Ss+ 09:45 0:00 /usr/bin/python /usr/bin/super
root 5 0.0 0.0 20228 1988 ? Ss 09:45 0:00 /bin/bash
root 11 0.0 0.0 91164 4460 ? S 09:45 0:00 nginx: master process /usr/sbi
root 14 0.0 0.0 184924 1196 ? Ssl 09:45 0:00 /usr/sbin/rsyslogd

rolfvreijdenberger on 30 Jun 2015

Clearly a bug. I've just experienced the same where

web_1      | 2015-07-22 23:25:39,985 INFO exited: tmpcreator (exit status 0; not expected)

despite 0 being the default exit code and works exactly the same if I exclusively set exitcodes=0.

My workaround was also setting startsecs=0.

ain on 23 Jul 2015

👍7 🎉1 😄1

Clearly a bug.

You're not giving the Supervisor developers much to work with here (you're definitively saying you have found a bug but have only provided one line of log output with little configuration details and no explanation about what happened or didn't happen).

web_1 | 2015-07-22 23:25:39,985 INFO exited: tmpcreator (exit status 0; not expected)
despite 0 being the default exit code and works exactly the same if I exclusively set exitcodes=0

I'll have to guess your issue is that you set exitcodes=0 but you were surprised by the log message saying the exit status was 0 and that the exit was not expected.

https://github.com/Supervisor/supervisor/blame/3.1.3/docs/configuration.rst#L664-L671

``startsecs``

The total number of seconds which the program needs to stay running
after a startup to consider the start successful.  If the program
does not stay up for this many seconds after it has started, even if
it exits with an "expected" exit code (see ``exitcodes``), the
startup will be considered a failure.  Set to ``0`` to indicate that
the program needn't stay running for any particular amount of time.

Note: _even if it exits with an "expected" exit code (see exitcodes), the startup will be considered a failure._ It sounds like your process exited with status 0 but didn't stay up for startsecs so it was considered a failure as described. The log probably also has the message "Exited too quickly (process log may have details)".

My workaround was also setting startsecs=0.

This is also suggested in the documentation quoted above.

mnaberez on 23 Jul 2015

I'm still getting this error with the newest supervisor update

My only workaround is to set startsecs=0 for my elasticsearch program

Is it correct/proper to set this value? Or would there be another solution?

My concern is because /var/log/supervisor/supervisor.log is being filled with exceessive elasticsearch spawn statements....

2017-01-24 15:02:05,780 INFO success: elastic_search_1 entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2017-01-24 15:02:05,797 INFO exited: elastic_search_1 (exit status 0; expected)
2017-01-24 15:02:06,800 INFO spawned: 'elastic_search_1' with pid 10002
2017-01-24 15:02:06,805 INFO success: elastic_search_1 entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2017-01-24 15:02:06,819 INFO stopped: elastic_search_1 (terminated by SIGQUIT)

[program:elastic_search_1]
command=/usr/sbin/service elasticsearch start
autostart=true
autorestart=true
stopsignal=QUIT
exitcodes=0
numprocs=1
stdout_logfile=/var/log/supervisor/%(program_name)s-stdout.log
stderr_logfile=/var/log/supervisor/%(program_name)s-stderr.log
startsecs=0

Would love more input on this. Thanks!

rlam3 on 24 Jan 2017

@rlam3 - startsecs = 0 is correct.

If it restarts repeatedly you might want to look at the command you're using and ensure it's running the process in the foreground.

jdeathe on 24 Jan 2017

👍3

@jdeathe should we not be using using service command to start elasticsearch? would you recommend some other way in starting up and have supervisor monitor elasticsearch?

would love to see a smarter/best practice way of monitoring elasticsearch. thanks!

really appreciate your help here.

rlam3 on 24 Jan 2017

Found a better way of running elastic search without startsecs=0 and won't be bloating up supervisor.log

Reference
https://github.com/thomasvan/ubuntu16-magentoee2-nginx-php7-elasticsearch-supervisord-ssh/blob/master/supervisord.conf

[program:elasticsearch_node_1]
command=/usr/share/elasticsearch/bin/elasticsearch -p /var/run/elasticsearch/elasticsearch.pid -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch
user=elasticsearch
autostart=true
autorestart=true
redirect_stderr=true
numprocs=1
stdout_logfile=/var/log/supervisor/%(program_name)s-stdout.log
stderr_logfile=/var/log/supervisor/%(program_name)s-stderr.log

rlam3 on 24 Jan 2017

+1
2018-12-23 16:20:41,100 CRIT Supervisor running as root (no user in config file) 2018-12-23 16:20:41,100 INFO Included extra file "/etc/supervisor/conf.d/tomcat.conf" during parsing 2018-12-23 16:20:41,105 INFO RPC interface 'supervisor' initialized 2018-12-23 16:20:41,105 CRIT Server 'unix_http_server' running without any HTTP authentication checking 2018-12-23 16:20:41,105 INFO supervisord started with pid 20564 2018-12-23 16:20:42,108 INFO spawned: 'tomcat' with pid 20602 2018-12-23 16:20:42,145 INFO exited: tomcat (exit status 0; not expected) 2018-12-23 16:20:43,147 INFO spawned: 'tomcat' with pid 20679 2018-12-23 16:20:43,159 INFO exited: tomcat (exit status 0; not expected) 2018-12-23 16:20:45,162 INFO spawned: 'tomcat' with pid 20815 2018-12-23 16:20:45,192 INFO exited: tomcat (exit status 0; not expected) 2018-12-23 16:20:48,196 INFO spawned: 'tomcat' with pid 21005 2018-12-23 16:20:48,222 INFO exited: tomcat (exit status 0; not expected) 2018-12-23 16:20:48,223 INFO gave up: tomcat entered FATAL state, too many start retries too quickly