salt exit codes

Created on 26 Nov 2014  路  60Comments  路  Source: saltstack/salt

is there a reason why test.* returns False and exits with bash exit code 0 instead of 1 ?

~ # salt some-minion file.access  /var/run/reboot-required f; echo $?
<>:
    False
0

~ #  salt some-other-minion  file.access  /var/run/reboot-required f; echo $?
<>:
    True
0

Bug Core P1 severity-medium

Most helpful comment

+1
This issue has caused great pain in our normal deployment process. Actually it has caused a few production live site issues already. We have to apply other ad-hoc detection of the deployment result which is really unnecessary. For state.highstate at least we have succeeded and failed summary, but for others such as state.apply or cmd.run we have to review all the output and check if the command really succeeds or not.

I can't believe this is addressed as low priority. It is opened 2 years ago and still open.

All 60 comments

Thanks for bringing this up, I believe the behavior you are seeing is the expected behavior. The command salt <minion> file.access is executed successfully in both cases, so the return value in both cases is 0, regardless of the result of the file.access function itself.

hey, sorry. i would expect a different behavior, like if my test goes wrong, the return code should be 1 (http://tldp.org/LDP/abs/html/exitcodes.html) regardless if my salt command runs successfully.

salt <minion> file.file_exists /some/file/that/doesn't/exists ; echo $?
   False
0

salt <minion> file.file_exists /some/file/that/exists ; echo $?
   True
0

My reason for thinking a return of 0 was appropriate in this circumstance is that I am thinking of a return code as local to a system. Since salt runs successfully as intended on the master, it's return code would not be affected by the state of the remote minion. I admit that my reasoning on this is not very solid and am inclined to rather support your arguments.

A further consultation of the applicable standards as you have suggested is needed. I believe there is an issue somewhere about generally refactoring return statuses/return codes. I can't find it at the moment, but that ought to be referenced here.

Here it is: #7013.

@jfindlay , thanks for the info on #7013. This bug should be classified above "Low Severity", IMO.

@tjyang, you're right. It seems that not all the return code issues were not resolved with #11337.

This is the plan for unifying and standardizing salt exit codes.

  • salt beryllium:

    • Introduce a standard exit code enum-like data structure, probably somewhere in exitcodes.py such that integers and enums could be used interchangeably, but with enums being preferred within the source code.

    • Add a configuration option to enable the use of this new exit code standard by automatically evaluating the truthiness of the return value of the execution module functions and translating it into exit codes.

  • salt boron: Ensure that all salt execution modules explicitly comply with
    the new salt exit code standard.

@thatch45, @cachedout, @basepi, or @UtahDave may have further commentary on this issue.

+1

+1

+1

I've just raised/tested few other similar issues particularly important for me. Searching for "zero exit" gives a list of issues related to lack of error indication for CLI.

Lack of usability

It effectively makes automation (the very similar reason why Salt is even used in the first place) around Salt commands very inconvenient. And it definitely seems pervasive for Salt (many CLI commands, many functions called through CLI, etc.).

For example, the following are some use cases which require developing custom scripts to analyze Salt output just get Failed/Succeeded result:

  • VM provisioning tools (such as Vagrant with all possible providers) which use Salt to configure new instance won't detect failure.
  • Any Jenkins (or other CI platform) jobs which apply configuration or run other Salt jobs won't detect failure.

relates to #2417.

Hi,

+1 @uvsmtid for any jenkins jobs, and anything using state.orchestrate

I've just revisited this place due to related problems and remembered that I wrote an ad-hoc script to sniff failures from Salt output as part of Salt-based automation project. The script, however, is independent/single/standalone Python file. Some additional details in this SO answer.

@uvsmtid, do you know if there are other github issues that document the issues you mention in that stackoverflow answer? I have some ideas to automate plugin conformation with unit and possibly integration testing. This is a big project that I've been wanting to do for a while and need to set aside time for it, but first I need to collect the scattered wisdom and user experience on the details for which no clear convention has been either implemented or documented. I think most users would agree that a consistent and a standard plugin behavior is much more important than a favorite plugin behavior.

@uvsmtid, also my stackoverflow credit is too low to comment there, but if you want to parse the output that is returned by salt, you may have more success writing a custom returner that ensures a valid json data structure is outputted. I know there is at least one open issue, which I am having trouble finding. related to this for the default returner, which prioritizes real time info over producing valid json since you don't know when or if minions will return before the timeout.

@jfindlay, I din't look up the output issues (1: more than one JSON object, 2: mixed STDOUT of JSON and non-JSON) - it was directly from my observation on the spot. The custom returner may be better, but I haven't written any so far - so, it's to be looked into.

We can split concerns at this point: (A) exit codes and (B) output format/convention/protocol.
If you create feature-issue regarding "automate plugin conformation", I'll help to find or create non-duplicate bug-issues and link them under your solution umbrella to track.

@uvsmtid, thanks for the feedback. The stdout issues are going to have to be addressed in the returner/outputter context and you are welcome to open an issue about that because I fail at finding the other issue(s). I've dumped all my ideas about exec and state modules in #25142 and #25143, respectively.

+1

A big reason for us selecting a tool like salt is to orchestrate deployments.

@uvsmtid - thank you for your contributions to this issue

et moi +1

@jfindlay so 2015.8 is which code name?

2015.8.0 is codenamed Beryllium.

Boron is tentatively targeted for 2015.11.0

I'm amazed this was originally labelled as low priority!!!

+1

@rgardam welcome to the party :) ...nothing should surprise i'd say

Where is the proper place to cheer for this issue? Just integrated a runner into my Buildbot workflow, but would need to do a jobs.lookup_jid on an ID parsed from the output of the original runner and in turn parse the YAML output of the lookup and examine the success key.

I'd much rather have $? ...

My actual issue is that I see a return code of 0 even when my runner explodes (like a Python exception occurs) which is not very useful.

@bmcorser, you've found the right place, cheer away.

+1

immediately stopping builds when test.ping returns $? -ne 0 would be just nice

+1

We're using Salt as part a continuous delivery workflow, and it's really problematic that when running salt '*' state.highstate on the master, when states fail in minions for some reason (with Result: False in the logs), the salt command still exits with 0. We'll have to resort to parsing log files to find out if the one thing that Salt does in that workflow (configuration management) actually succeeded or not.

+1 exit code standard is required. Running state.* vs file.file_exists (or test.ping etc) should be treated differently. And also target should also have an impact e.g. '' vs real hostname or set of hostnames. e.g. salt '' state.highstate (soft error) vs salt 'hostname' state.highstate (hard error)
There is sort of a standard.. http://tldp.org/LDP/abs/html/exitcodes.html
If something does not go to plan it should indicate to look at the logs with an exit code.

May be an option to indicate a person is calling it as part of a script, or set of options which indicates what the executer cares about. e.g. missing hosts, or failed to run on a host, or does not care if it does not run on a host.

May be a bit based exit code for apply state, highstate

 0 - All as expected, no need to review the output
 1 - Not All Targets Reachable
 2 - Issues Applying states
 4 -
 8 - 
16 -
32 - (last one) as exit code 127 is not found

1+2= exit code 3

+1

The whole CI+Salt setup is insanely powerful, but somewhat limited by this.

It is also important to expose this to the salt-api.
Here is the output from a 2015.8.8 salt command invocation of cmd.run 'false 1':

# salt 'web1' cmd.run 'false 1'
web1:
ERROR: Minions returned with non-zero exit code

While the API does not include the necessary info when sending to the / endpoint:

{
    "return": [
        {
            "web1": ""
        }
    ]
}

When you lookup the job via /jobs/JID you get slightly more info, but still nothing useful for determining exit status.

{
    "info": [
        {
            "Arguments": [
                "false 1"
            ],
            "Function": "cmd.run",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": ""
                }
            },
            "StartTime": "2016, Apr 11 15:59:42.643646",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160411155942643646"
        }
    ],
    "return": [
        {
            "web1": ""
        }
    ]
}

and this same issue occurs with the service execution module. Here is an example of a service failing to restart:

{
    "info": [
        {
            "Arguments": [
                "klsdklsd"
            ],
            "Function": "service.restart",
            "Minions": [
                "web1"
            ],
            "Result": {
                "web1": {
                    "return": false
                }
            },
            "StartTime": "2016, May 27 14:08:28.970138",
            "Target": "web1",
            "Target-type": "glob",
            "User": "jenkins",
            "jid": "20160527140828970138"
        }
    ],
    "return": [
        {
            "web1": false
        }
    ]
}

+1
How come there is still no way of knowing the exit status of a remote command (cmd.run) ?
salt-run jobs.print_job [jid] doesn't return anything status-related... :(

This not being addressed as top priority is a pure negligence. How do you think salt* usage would be scripted without this bug being resolved? How do you test your executables if your exit codes don't have a meaning??

+1
This issue has caused great pain in our normal deployment process. Actually it has caused a few production live site issues already. We have to apply other ad-hoc detection of the deployment result which is really unnecessary. For state.highstate at least we have succeeded and failed summary, but for others such as state.apply or cmd.run we have to review all the output and check if the command really succeeds or not.

I can't believe this is addressed as low priority. It is opened 2 years ago and still open.

This is really serious compliance issue. How I can make sure without parsing output that all my states applied correctly or will applied correctly?

+1

:+1:

@blacked We are using serverspec on every salt run to make sure that the states are actually applied correctly.

@meggiebot Please include me when this is added to a sprint for Carbon. I would like to coordinate these changes with salt-api and Salt Enterprise so that all three are on the same page going forward.

+1

Isn't this what --retcode-passthrough is supposed to do?

I could be mistaken but I believe this is what --retcode-passthrough is supposed to _expose_.

@whiteinge , somehow I'm just not understanding what the nuances are of your term "expose". I get the feeling I'm missing something.

My understanding is that when --retcode-passthrough is specified that the aggregate exit status of the command is returned rather than the status of the salt executable itself.

As far as compliance (@blacked), it would be reasonable to query the job cache to see if any of the retcodes were non-zero. (yes, I get that it's somewhat awkward)

retcode-passthrough is not generic for all salt tools:

[root@salt-master ~]# salt-cloud --retcode-passthrough
Usage: salt-cloud

salt-cloud: error: no such option: --retcode-passthrough

Right, --retcode-passthrough does this, and it does not really make a lot of sense for salt cloud, because salt-cloud should just five us good retcodes.
The real issue here is that we need to create a standard retcode definition and location, then go through the execution modules and normalize the retcodes to the standard

I would expect that there would be a range of retcodes for the specific tool and then another range for failures in anything that the tool manages. For example 0-127 might represent status for the specific tool (viz. salt) where subprocesses/managed devices would have an automatic passthrough of their retcode+128. That way if all you care about is pass/fail you get it without the --retcode-passthrough option. If you want to distinguish between a failure in the invoked tool vs. managed devices then you could test the range on the retcode. One thing to keep in mind is that status of 128+SIGNUM. Often times on POSIX exit status is only 8 bits rather than the full int - although newer calls that use waitid() have access to the full int (which is most software by now).

I have no idea what Windows does.

We could certainly compress to a narrow subrange just for minion retcode pass-through.

There is also an additional problem: several of the CLI tools have broken exit status handling where some of them _always_ exit with status 0 even when there is an error. I have been working through a set of fixes as well as integration tests for many of the CLI tools. Some of them are submitted and merged and others are waiting acceptance.

Examples:
* #33762
* 1001d2b9e7e630c005cb24d5c43211f5d47d1b11
* 48fc51cb7b39b2d02c5b4b7adcc904962180a54b

Similar problem with salt-call... exits with 0 even when there's a serious error, e.g. "Unable to render top file". --retcode-passthrough does not remediate this.

Best solution is to tee the output to a file, and then grep for error messages.

Give me a working example and I will see what I can do

ZD-935

I fixed about one of these bugs, #35964 .

"Unable to render top file" is working for me via salt-ssh, which calls salt-call internally. This is based on a pull request I did a while ago.

Version: 2016.3.2

salt-ssh -i 'foo.acme.com' state.highstate
[ERROR   ] Unable to render top file: while parsing a block mapping
  in "<unicode string>", line 1, column 1:
    base:
    ^
expected <block end>, but found '<block sequence start>'
  in "<unicode string>", line 4, column 5:
        - base
        ^
foo.acme.com:

Summary for foo.acme.com
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
[ngrennan@host _ssh_master_dev]$ echo $?
20

'*' vs '*': in top.sls

ZD-1502 & ZD-935

Wait, what is the current status of retcodes for salt operations? It's been 4 years since the issue was created...

Exit codes have been normalized in https://github.com/saltstack/salt/pull/48361 for the upcoming Fluorine release. The following conditions will set a nonzero exit code:

  • An exception is raised
  • The code being run sets __context__['retcode'] to a nonzero value
  • The return data for a given function is a dictionary with either a result or success key in it, which resolves as False

file.file_exists would not trigger any of those conditions, however. The problem here is if we use something like __context__['retcode'] in file.file_exists to denote failure, then we've marked the run as failed, so if any state invokesfile.file_exists, this would make salt reutrn a nonzero exit code even if the file not existing was expected (and therefore not an error).

I'm a bit confused about the status - I've just picked up salt 2018.3.4 which has the latest fixes in for return codes for salt-call. If I pass the argument --retcode-passthrough to salt-call I now get non-zero exit codes for failures but without it I always get 0 - is this the expected functionality at the current time? Anybody know when this functionality becomes the default?

I think this is expected, as far as I know. To get status code from operation executed by salt-call you have to use --retcode-passthrough if nothing was changed.

@oliver-dungey As explained above, the retcode changes are in the Fluorine release (2019.2.0).

@saltstack/team-core Since Fluorine was released, this can be closed.

Using salt version below, the issue with the incorrect exit code is still present when you add --batch-size and --batch-wait parameters

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: 1.6.0
       cherrypy: unknown
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: 0.26.3
        libnacl: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: 3.7.3
         pygit2: 0.26.4
         Python: 2.7.5 (default, Apr  9 2019, 14:30:50)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

sudo salt -v -L minion-1,minion-2 --batch-size 50% --batch-wait 1 state.apply queue=True test-state

Even though there is an error:

jid:
    20190613124535690469
retcode:
    1
gc-euw1-salt-1:
    Data failed to compile:
----------
    Rendering SLS 'base:test-state' failed: Jinja syntax error: expected token 'end of statement block', got 'string'; line 1

salt exit status is 0.

echo $?
0

Please open a new issue. Commenting on a closed issue doesn't do much good.

Was this page helpful?
0 / 5 - 0 ratings