supervisorctl status returns exit code 3 for STOPPED state

Created on 8 Apr 2019  路  15Comments  路  Source: Supervisor/supervisor

When I run supvervisorctl status command, it will exit with code 3, but I can't see any error message

supervisorctl

Most helpful comment

I just ran into this. We deploy stuff using Salt and salt uses supervisorctl status to get list of running/configured apps. Exit code 3 breaks this as salt considers any exit code other than 0 to be a failure.

I can understand why exit codes may be used as a way of communicating the status of processes (although I don't agree with the approach). However, I feel STOPPED and STOPPING should be excluded from list of _bad_ exit codes. If a process is FATAL or BACKOFF then it makes sense to send a 3 exit code but it seems unfair to "fail" a command (non-zero exit code) if some processes have been stopped on user request.

All 15 comments

Reproduce Instructions

Create a config file supervisord.conf:

[supervisord]
nodaemon = true

[inet_http_server]
port = 127.0.0.1:9001

[supervisorctl]
serverurl = http://127.0.0.1:9001

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[program:cat]
command = /bin/cat
autostart = false

Start supervisord in the foreground with that config:

$ supervisord -c supervisord.conf 

In another terminal, run supervisorctl status:

$ supervisorctl -c supervisord.conf status
cat                              STOPPED   Not started
$ echo $?
3

This was introduced in Supervisor 4.0.0 and is caused by these lines added in PR #668:

https://github.com/Supervisor/supervisor/blob/662ef84d2798a0e7f6fa18c86f44044c9768012f/supervisor/supervisorctl.py#L683-L685

supervisorctl status will set exit code 3 any time a subprocess is in states.STOPPED_STATES. These are: STOPPED, EXITED, FATAL, and UNKNOWN.

cc: @lukeweber

:+1:

We have deployment scripts that run supervisorctl status > /dev/null && echo $? to check if supervisor is working properly - and found that 4.0.0 returns 3 because we have some old+busted apps that are perpetually in a stopped state.

For now we will downgrade to a 3.x.x version that works for us, but good to know we're not crazy!(?).

Some updated documentation on this would be great, how could I contribute?

Remember kids: pin your dependency version numbers to something stable!

EDIT: Actually, we find this behavior in 3.3.4 too(?) so our solution is to merely change our deploy script to execute supervisorctl version instead of status and check for exit code 0.

(This is how we validate that supervisor is running and healthy before we attempt a deploy)

EDIT: Actually, we find this behavior in 3.3.4 too(?) so our solution is to merely change our deploy script to execute supervisorctl version instead of status and check for exit code 0.

The PR with this change was not merged into the 3.x branch. The change being discussed here, that supervisorctl status sets the exit code to 3 if a process is in the STOPPED state, was introduced in 4.0.0.

I tested the last 3.x release (3.4.0) and the version you claim is affected (3.3.4) using the reproduce instructions above. For both versions, echo $? shows 0 as it always did in Supervisor 3.x.

Supervisor 4.0.0 is a new major release of Supervisor. The major version number was bumped from 3.x to 4.x to communicate breaking changes. Please see the changelog for a list of changes in 4.0.0. One of the changes in 4.0.0 is that supervisorctl now sets exit codes for more conditions.

In Supervisor 3.x, very few conditions caused the exit code to be non-zero. For example, you could supervisorctl start zope and if zope failed to start, the exit code would still be 0. There were several tickets and many commenters requesting that supervisorctl be changed to set meaningful exit codes when possible. So, supervisorctl was overhauled to do this, and those changes were released in 4.0.0. Scripts that expect the exit code of supervisorctl to almost always be 0 will need to be updated.

This ticket is specifically about supervisorctl status setting exit code 3 whenever a process is in the STOPPED state. If you believe this specific behavior is incorrect and/or causes you trouble, please leave a detailed comment why.

Apologies, some confusion over here. 3.4.0 definitely doesn't show the behavior.

$ sudo supervisorctl status
<novus service>                      STOPPED   Apr 09 11:50 PM
$ echo $?
0
$ sudo supervisorctl version
3.4.0

Our infra is just a bit messed up and even though we downgraded from 4.0.0 to a prior version, we still needed to force the supervisord process to restart in order for it to use the 3.4.0 version.

Thanks!

It has been two full months since this issue was reported. There have been no additional comments here, and no related issues reported, in that time.

Many users requested that supervisorctl set exit codes to make automating it easier. The change was made in a major release (4.0). The change will not be reverted. Users who need the old behavior where supervisorctl does not set exit codes will need to stay on a 3.x version.

@mnaberez I fail to see how this ...:

The major version number was bumped from 3.x to 4.x to communicate breaking changes. Please see the changelog for a list of changes in 4.0.0. One of the changes in 4.0.0 is that supervisorctl now sets exit codes for more conditions.
[...]
This ticket is specifically about supervisorctl status setting exit code 3 whenever a process is in the STOPPED state. If you believe this specific behavior is incorrect and/or causes you trouble, please leave a detailed comment why.

... is "shown" here:

- ``supervisorctl`` will now set its exit code to a non-zero value when an
  error condition occurs.  Previous versions did not set the exit code for
  most error conditions so it was almost always 0.  Patch by Luke Weber.

I understand that you have improved this, but, I don't see why supervisorctl status would return 3 on services that I haven't autostarted / called ever since registering. I am not sure what the correct fix would be.

  • I would like that, when I am trying to check service's status, that I am greeted with $?=0 when supervisorctl "feels" that everything is "business as usual" and "no internal errors"?
  • Then, there are other issues that are coming: How am I to discern when e.g. this <class ''xmlrpclib.Fault''>, <Fault 6: ''SHUTDOWN_STATE''> failure (https://github.com/Supervisor/supervisor/issues/48), from (IMHO) a "valid" state (like the one described above)?

I fail to see how this ...:
... is "shown" here:

At this point, this issue has been open for discussion for two full months with no additional feedback. I understand you do not agree with the precise wording of the changelog, but the exit codes will not be reverted. Many users requested them. The current implementation was in the git master branch for almost a year before it was released, and no feedback then either.

I would like that, when I am trying to check service's status, that I am greeted with $?=0 when supervisorctl "feels" that everything is "business as usual" and "no internal errors"?

It doesn't have a way to divine what "business as usual" means so instead it uses the exit code to communicate the process status from a predefined list. If the processes are all running then you get 0, if a process is stopped then you get 3.

Then, there are other issues that are coming: How am I to discern when e.g. this , failure (#48), from (IMHO) a "valid" state

An exit code of 0 means all subprocesses are running. An exit code of something else means the exit codes need to be compared to see what the issue might be. If 3, then a process is stopped. If some other value, then something else is happening.

[...]
, and no feedback then either.

In my case, I was pulling it as apk add supervisor. I recently found out it pulled python2 however; so now I tried pip3, which then "informed" me of all this. I understand that limited feedback causes this, and I apologise for that

It doesn't have a way to divine what "business as usual" means.

If I do systemctl status on my Ubuntu, it won't give me non-zero status because e.g. apache is stopped (especially if it's stopped / non-started by me).

I understand that supervisors task might be different though (manage a single [?] containarized service?), so, I am definitely not saying to revert the change.

However, there are tools that depend on "success" exit code (i.e. 0), like Ansible, that need to be handled appropriately.

An exit code of 0 means all subprocesses are running. An exit code of something else means the exit codes need to be compared to see what the issue might be. If 3, then a process is stopped. If some other value, then something else is happening.

So, does that mean "definitely" that: supervisorctl status gives $?=3 when "everything is okay", except that there are managed services that are stopped in some way?

I could teach Ansible that failed_when: supervisorctl_status.rc not in [0,3]; however, it makes it complicated to check then if a service is in an "unexpected" vs expected stop state.

Color-reading this:
http://supervisord.org/_images/subprocess-transitions.png

It seems that there is no expected vs unexpected discrimination on stopped states (I _obviously_ read the text too, but, handling appears to be as described)

Previous discussion about this included in these, maybe useful for background: https://github.com/Supervisor/supervisor/issues/24, https://github.com/Supervisor/supervisor/pull/620

Actual change: https://github.com/Supervisor/supervisor/pull/668

I did this work about 4 years ago, so forgive me if I miss a detail, and the issue has been around since 2011. One should be able to monitor a single process by supplying the name of the process you're interested in of list of them which might help, i.e. supervisorctl status foo baz not including your bad process name that never runs that you don't want status on(why monitor it if it's never running right?).

The previous state, was that even if you couldn't connect to the supervisord daemon, then supervisorctl would return 0 with a difficult to parse error string, which was definitely a problem for many(see tickets and related tickets).

So we tried to make things a bit more useful by standardizing on:
http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/iniscrptact.html

Aggregated status was always a bit of a question. I think if supervisord wanted to model it's status command on systemctl then adding is-enabled, is-available, and is-failed would be good additions for the future. In the interim, if you wanted a safer behavior that detected failure cases at the status level, and don't mind stopped processes, ignoring 3 is maybe an option. If you want the original behavior, you probably better be parsing the text output and then you should probably ignore the status as it was almost always 0 anyways.

Would be nice to ground a feature request maybe in a specific need. I don't believe that in a majority of cases where people check for status=0 they're actually getting something more than a false sense of security, and maybe verification that python and supervisorctl is installled.

I am going to reopen this issue. This is not a reversal of any decision. My intention is to make it easier for this discussion to continue for a while since there is new activity.

@lukeweber

supervisorctl status foo baz not including your bad process name that never runs that you don't want status on(why monitor it if it's never running right?).

In this case, I want supervisors status (hence the supervisorctl status). And so far, the approach I am taking in my project is: autostart necessary services (e.g. sshd), and on-demand start whatever is needed / bundled in (e.g. apache).

I may be able to use the supervisor.d/*.conf scheme with conditional copies, but it was easier for a spike to "just have all the definitions there"

Would be nice to ground a feature request maybe in a specific need.

Yes, this is why I am discussing (and explicitly not to ask to revert #668, but rather more "finely define it"). So far, I have only spiked its usage, so at the very most, I can provide a "formal" use case (second part of https://github.com/Supervisor/supervisor/issues/1223#issuecomment-500738348).


I don't believe that in a majority of cases where people check for status=0 they're actually getting something more than a false sense of security, and maybe verification that python and supervisorctl is installled.

Returning always zero is a bad thing. The other problem is that, only 0 is "the one true all good" exit code, and then the rest indicate some kind of "not as expected" behavior.
However, overloading exit codes can be equally bad:

  1. If supervisor's objective was "systemctl but smaller and simpler", then "overloading" exit code to say if there are stopped services would be wrong.
  2. If supervisor's objective is to provide service management for minimilistic docker containers, where all defined services are defined as MUST be up, then returning non-zero is mandatory.

If [2] is true, it would be nice that there is an explicit mention somewhere that 'supervisorctl status'.rc == 3 is that and only that. And make sure not to "mangle" LSBInitErrorCode.UNIMPLEMENTED_FEATURE with LSBStatusErrorCode.NOT_RUNNING (since both of them return 3)

I just ran into this. We deploy stuff using Salt and salt uses supervisorctl status to get list of running/configured apps. Exit code 3 breaks this as salt considers any exit code other than 0 to be a failure.

I can understand why exit codes may be used as a way of communicating the status of processes (although I don't agree with the approach). However, I feel STOPPED and STOPPING should be excluded from list of _bad_ exit codes. If a process is FATAL or BACKOFF then it makes sense to send a 3 exit code but it seems unfair to "fail" a command (non-zero exit code) if some processes have been stopped on user request.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

flaugher picture flaugher  路  30Comments

ojii picture ojii  路  20Comments

jvanasco picture jvanasco  路  46Comments

lra picture lra  路  60Comments

mminer picture mminer  路  41Comments