While Salt does exit with a non-zero status when states fail, it does not have a reliable way of reporting that a remote-execution function failed. It is not feasible to determine this simply by a boolean evaluation of the function's return data, as many functions are informational in nature (e.g. service.running
, pkg.version
, etc.) and a return value that evaluates as False
doesn't necessarily mean that the function failed.
The minion is already setting the retcode for states based on __context__['retcode']
. In minion.py, this value is checked here and the job's return code is set accordingly.
Therefore, the following is proposed to standardize return code handling for remote-execution functions:
In this exception handling code, set ret['retcode']
to a non-zero integer value in each except
block. This will get us a non-zero exit status for free, for any remote-execution functions which raise an exception, and we already have a great deal of exception raising already built into Salt code for various error conditions.
Modify existing code to raise exceptions when the result of a function is a failure. NOTE: This will not be done. The PR opened to address this will rely on either __context__['retcode']
or for the return data to be a dictionary with either a result
or success
key, in cases where no exception was raised.
Note that there are some remote execution functions that simply return the result of a cmd.run_all
, or even a state return dictionary (i.e. one with name
, result
, changes
, and comment
keys).
For cases where a cmd.run_all
dictionary is returned, we can simply raise a CommandExecutionError('some error msg', info=cmd_result)
, where cmd_result
is the name of the dict we'd normally be returning. This may require that we do exception catching in any state which invokes that remote-execution function. When doing so, we would be able to obtain the original cmd.run_all
dict from exception's info
attribute:
try:
result = __salt__['some_mod.some_func'](name, foo, bar, **kwargs)
except CommandExecutionError as exc:
# do some stuff here to gracefully return a False result for the state.
# exc.info will contain the original cmd.run_all return dict
For cases where the remote-execution function returns a state-like return dict, we could either take the same tactic as above, or we could simply set __context__['retcode']
before returning.
ping @saltstack/team-suse
the exception=non-zero ret plan seems sane and reasonable to me, but for cmd.run* based interfaces, could that be considered as the standard interface?
so, 4 official ways to handle inducing a failure in an exec mod:
{'retcode':False, 'content': {...}}
__context__['retcode'] = False
__context__
is fine for adding a hint to the state engine for a func that already has a True/False based interface, but it's kind of clunky if that is the standard way of communicating a hint to the state engine of a failure, given it requires poking at a global magic context dict to trigger some action at a distance.
In my opinion canonicalizing {'retcode':False}
is a less surprising way for salt to act for 99% of salt exec mods that need to do something like this. And if exceptions are your fancy as an end-user CommandExecError can induce the retcode as well.
my 2 cents :)
I am totally all for this. Also False
doesn't mean "failed" (it also could mean "minion down", or worse "I did not got the response from the minion"). I would use this structure all around everywhere.
The only concern here is that not all system calls returns what it means. Say, zypper has own exit codes https://en.opensuse.org/SDB:Zypper_manual_(plain) and not always non-zero means failure.
@austinpapp Of course, whatever is decided would be documented.
However, standardizing around a dict for returns doesn't sound like a good idea to me. I like how many functions can return a simple True/False/etc. return, and I don't see how this would be improved by making all returns dictionaries. If we are talking about simply knowing whether or not something failed, raising an exception seems more Pythonic to me.
I've opened https://github.com/saltstack/salt/pull/48361 to implement the retcode handling discussed here. Per some additional discussion with @mattp- I added logic to check the the return value for the function to see if it is a dict with either a return
or success
key, and use those to determine success or failure.
The body of the PR message contains some further points of discussion, I would love to get feedback on the issues raised there.
Consider what I did with windows pkg.refresh_db failhard=False --out json
which was to output result as data even if it failed and the user is responsible for checking the data return with some other program. The problem with raise CommandExecutionError etc is the output is no longer a valid data structure.
if error_count > 0 and failhard:
raise CommandExecutionError(
'Error occurred while generating repo db',
info=results
)
else:
return results # user needs to check the data returned for the error information.
$ salt -G os:Windows pkg.refresh_db failhard=False --out=json 2> /dev/null
{
"testminion": {
"failed": 2,
"total": 242,
"failed_list": {
"dvscheduler2.sls": [
"package 'dvschedulerX', version number 1 is not a string",
"package 'dvschedulerX', version number 2 is not a string"
],
"bad-file.sls": [
"Compiled contents, not a dictionary/hash "
]
},
"success": 240
}
}
{
"downminion": "Minion did not return. [Not connected]"
}
$ salt -G os:Windows pkg.refresh_db failhard=True --out=json 2> /dev/null
{
"testminion": "ERROR: Error occurred while generating repo db. Additional info follows:\n\nfailed:\n 2\nfailed_list:\n ----------\n bad-file.sls:\n - Compiled contents, not a dictionary/hash \n dvscheduler2.sls:\n - package 'dvschedulerX', version number 1 is not a string\n - package 'dvschedulerX', version number 2 is not a string\nsuccess:\n 240\ntotal:\n 242"
}
{
"downminion": "Minion did not return. [Not connected]"
}
An alternative might be to change the structure of the output e.g.
salt \* --errors-as-data state.apply stuff.sls --out=json
{
minion1:
data: <normal output>
failed_list: <list> from python raise
failed: <int>
}
{
minion2:
data: <normal output>
failed_list: <list> from python raise
failed: <int>
}
{
--minion-summary--:
success_minion_total: <int> Add 1 if failed == 0 for each minion.
failed_minion_responded: <int > Add 1 if failed > 0 for each minion.
failed_minion_no_response: <int> Add 1 if minion did not respond.
failed_minion_total: <int> Total of the above two.
failed_states_total: <int> Total of all failed from each minion
total_minion: <int>
}
There are no plans to modify existing code to raise an exception, this was an initial suggestion. I've updated the initial post to reflect this.
Thanks for the discussion on this everyone! Since #48361 has been submitted, I am going to close this issue.
Most helpful comment
I am totally all for this. Also
False
doesn't mean "failed" (it also could mean "minion down", or worse "I did not got the response from the minion"). I would use this structure all around everywhere.The only concern here is that not all system calls returns what it means. Say, zypper has own exit codes https://en.opensuse.org/SDB:Zypper_manual_(plain) and not always non-zero means failure.