Salt: RFC: Standard for determining if a remote-execution function failed

Created on 18 May 2018  路  9Comments  路  Source: saltstack/salt

Problem

While Salt does exit with a non-zero status when states fail, it does not have a reliable way of reporting that a remote-execution function failed. It is not feasible to determine this simply by a boolean evaluation of the function's return data, as many functions are informational in nature (e.g. service.running, pkg.version, etc.) and a return value that evaluates as False doesn't necessarily mean that the function failed.

Proposed Solution

The minion is already setting the retcode for states based on __context__['retcode']. In minion.py, this value is checked here and the job's return code is set accordingly.

Therefore, the following is proposed to standardize return code handling for remote-execution functions:

  1. In this exception handling code, set ret['retcode'] to a non-zero integer value in each except block. This will get us a non-zero exit status for free, for any remote-execution functions which raise an exception, and we already have a great deal of exception raising already built into Salt code for various error conditions.

  2. Modify existing code to raise exceptions when the result of a function is a failure. NOTE: This will not be done. The PR opened to address this will rely on either __context__['retcode'] or for the return data to be a dictionary with either a result or success key, in cases where no exception was raised.

    Note that there are some remote execution functions that simply return the result of a cmd.run_all, or even a state return dictionary (i.e. one with name, result, changes, and comment keys).

    For cases where a cmd.run_all dictionary is returned, we can simply raise a CommandExecutionError('some error msg', info=cmd_result), where cmd_result is the name of the dict we'd normally be returning. This may require that we do exception catching in any state which invokes that remote-execution function. When doing so, we would be able to obtain the original cmd.run_all dict from exception's info attribute:

    try:
       result = __salt__['some_mod.some_func'](name, foo, bar, **kwargs)
    except CommandExecutionError as exc:
       # do some stuff here to gracefully return a False result for the state.
       # exc.info will contain the original cmd.run_all return dict
    

    For cases where the remote-execution function returns a state-like return dict, we could either take the same tactic as above, or we could simply set __context__['retcode'] before returning.

Pending Discussion ZRELEASED - Fluorine ZRETIRED - RFC

Most helpful comment

I am totally all for this. Also False doesn't mean "failed" (it also could mean "minion down", or worse "I did not got the response from the minion"). I would use this structure all around everywhere.

The only concern here is that not all system calls returns what it means. Say, zypper has own exit codes https://en.opensuse.org/SDB:Zypper_manual_(plain) and not always non-zero means failure.

All 9 comments

ping @saltstack/team-suse

the exception=non-zero ret plan seems sane and reasonable to me, but for cmd.run* based interfaces, could that be considered as the standard interface?
so, 4 official ways to handle inducing a failure in an exec mod:

  • return False
  • if ret is a dict with a a state-like structure, ie {'retcode':False, 'content': {...}}
  • set __context__['retcode'] = False
  • treat CommandExecutionError as a failure

__context__ is fine for adding a hint to the state engine for a func that already has a True/False based interface, but it's kind of clunky if that is the standard way of communicating a hint to the state engine of a failure, given it requires poking at a global magic context dict to trigger some action at a distance.
In my opinion canonicalizing {'retcode':False} is a less surprising way for salt to act for 99% of salt exec mods that need to do something like this. And if exceptions are your fancy as an end-user CommandExecError can induce the retcode as well.

  1. it would make sense to me to standardize around a dict. for me, this feels natural. i also think that allows room to grow.
  2. regardless of what is decided, documentation has to be a priority. this includes previous examples. confusion leads to frustration which is never good for an open sourced project.

my 2 cents :)

I am totally all for this. Also False doesn't mean "failed" (it also could mean "minion down", or worse "I did not got the response from the minion"). I would use this structure all around everywhere.

The only concern here is that not all system calls returns what it means. Say, zypper has own exit codes https://en.opensuse.org/SDB:Zypper_manual_(plain) and not always non-zero means failure.

@austinpapp Of course, whatever is decided would be documented.

However, standardizing around a dict for returns doesn't sound like a good idea to me. I like how many functions can return a simple True/False/etc. return, and I don't see how this would be improved by making all returns dictionaries. If we are talking about simply knowing whether or not something failed, raising an exception seems more Pythonic to me.

I've opened https://github.com/saltstack/salt/pull/48361 to implement the retcode handling discussed here. Per some additional discussion with @mattp- I added logic to check the the return value for the function to see if it is a dict with either a return or success key, and use those to determine success or failure.

The body of the PR message contains some further points of discussion, I would love to get feedback on the issues raised there.

Consider what I did with windows pkg.refresh_db failhard=False --out json which was to output result as data even if it failed and the user is responsible for checking the data return with some other program. The problem with raise CommandExecutionError etc is the output is no longer a valid data structure.

    if error_count > 0 and failhard:
        raise CommandExecutionError(
            'Error occurred while generating repo db',
            info=results
        )
    else:
        return results   # user needs to check the data returned for the error information.

$ salt -G os:Windows pkg.refresh_db failhard=False --out=json 2> /dev/null
{
    "testminion": {
        "failed": 2,
        "total": 242,
        "failed_list": {
            "dvscheduler2.sls": [
                "package 'dvschedulerX', version number 1 is not a string",
                "package 'dvschedulerX', version number 2 is not a string"
            ],
            "bad-file.sls": [
                "Compiled contents, not a dictionary/hash "
            ]
        },
        "success": 240
    }
}
{
    "downminion": "Minion did not return. [Not connected]"
}
$ salt -G os:Windows pkg.refresh_db failhard=True --out=json 2> /dev/null
{
    "testminion": "ERROR: Error occurred while generating repo db. Additional info follows:\n\nfailed:\n    2\nfailed_list:\n    ----------\n    bad-file.sls:\n        - Compiled contents, not a dictionary/hash \n    dvscheduler2.sls:\n        - package 'dvschedulerX', version number 1 is not a string\n        - package 'dvschedulerX', version number 2 is not a string\nsuccess:\n    240\ntotal:\n    242"
}
{
    "downminion": "Minion did not return. [Not connected]"
}

An alternative might be to change the structure of the output e.g.

salt \* --errors-as-data   state.apply stuff.sls --out=json
{
minion1:
    data:   <normal output>
    failed_list: <list> from python raise
    failed: <int>
}
{
minion2:
    data:   <normal output>
    failed_list: <list>  from python raise
    failed: <int>
}
{
--minion-summary--:
    success_minion_total:  <int> Add 1 if failed == 0 for each minion.
    failed_minion_responded: <int > Add 1 if failed > 0 for each minion.
    failed_minion_no_response: <int> Add 1 if minion did not respond.
    failed_minion_total: <int> Total of the above two.
    failed_states_total: <int> Total of all failed from each minion
    total_minion: <int>

}

There are no plans to modify existing code to raise an exception, this was an initial suggestion. I've updated the initial post to reflect this.

Thanks for the discussion on this everyone! Since #48361 has been submitted, I am going to close this issue.

Was this page helpful?
0 / 5 - 0 ratings