Azure-cli: az <resource> show has inconsistent exit status

Created on 30 Nov 2017 · 9Comments · Source: Azure/azure-cli

Following template scripts generated from the Portal's Automation Script section, the CLI script uses checks for existing resources such as

az group show $resourceGroupName 1> /dev/null

if [ $? != 0 ]; then
    echo "Resource group with name" $resourceGroupName "could not be found. Creating new resource group.."
    set -e
    (
        set -x
        az group create --name $resourceGroupName --location $resourceGroupLocation 1> /dev/null
    )
    else
    echo "Using existing resource group..."
fi

az group show and az storage account show both return non-zero when the resource is missing and so the above pattern works. If being consistent across resources throughout my scripts, examples such as az storage share show and az vm show return zero regardless of the resource existing or not. Thus cannot be used to determine missing resources.

The issue is that the 'show' action doesn't always return a non-zero code for missing resources when used on other resource types. For consistency and approachability to new commands, I would thing these would behave the same across resources?

Environment summary

az --version 2.0.21
Python (Darwin) 3.6.3
Installed via homebrew

ARM Resource Manager question

Source

damienpontifex

All 9 comments

I am quite surprised to see that this is open and not a priority. How are users expected to do the absolute most basic of tasks, such as, (as @damienpontifex demonstrates) ensuring that resources exist and creating them if they don't exist?

How is anyone doing "ensure" logic reliably? How many bugs have been introduced because of this silently stopping to work due to whatever release regressed this and people not realizing that a widespread convention is no longer followed? I ask because it appears that this problem has grown since @damienpontifex filed this originally -- storage account and storage container now behave incorrectly as well.

(edit: Was this announced as a breaking change? I'd argue that it is...)

Is it the CLI's official stance that folks need to be catching and parsing string output in order to script Azure through the official CLI?

Finally, assuming this is a bug and will be fixed, can there please be positive and negative tests added for all resource types to ensure this doesn't happen again?

cc: @tjprescott

colemickens on 24 Apr 2018

Some more research. It looks like this was explicitly intentional:

Issue: https://github.com/Azure/azure-cli/issues/1952
PR "fixing" it: https://github.com/Azure/azure-cli/pull/2117

Here's a dupe where @devigned reported it and it was simply closed: https://github.com/Azure/azure-cli/issues/1845.

It appears that for some resource there is az [resource] exists but this doesn't even extend to the third or fourth resources I tried, so it's clearly not a reliable or consistent method either. Still holding out hope for something better than checking for an empty or whitespace-only string.

colemickens on 24 Apr 2018

A few things:

We don't control the portal's script generation script. If they are using non-zero error code as exists then it is a bug in their generation script
As you pointed out, it is intentional that we don't raise an error for this situation. You can chat with @johanste on the philosophy of it. :)
The commands which have exists are because those services have that endpoint and the command would return what the service does. What do you mean when you say it doesn't extend to the third or fourth resources you tried?

tjprescott on 25 Apr 2018

Two reasons for the original change:

1) Differentiate between:
a) The command successfully completed, but there was no matching resource.
b) The command failed for some reason - we don't know if there is a resource.
c) The command successfully completed, and there is a resource.

2) Enable users to use set -e in their bash scripts.

There should not be any parsing of the output needed (unless checking for an empty string constitutes parsing).

The only time where a command should return a failure code on exists is if an empty string does not uniquely indicate that the resource doesn't exists (consider a blob content show like command where it would be impossible to distinguish between an existing blob with empty content and a non-existent blob. If this is not happening, please open a bug.

@colemickens, are you advocating for having a universal "exists" command? The only other alternative I see is to have different error codes for "doesn't exist" and "failed to determine if it exists" - with the downside of this being a breaking change (again) as well as breaking the ability to use set -e in many cases...

johanste on 25 Apr 2018

However, as the title points out, it is true that not all show commands follow the empty_on_404 convention--however, it is what we promote.

tjprescott on 25 Apr 2018

I am surprised and disappointed that we sort of seem to be bike-shedding about rather basic functionality of a CLI, and that apparently, there isn't a prescribed, documented, and tested way of implementing "ensure resource" logic for Azure with the official CLI.

To be specific:

The current behavior doesn't follow conventions of other CLI tools. Docker, Kubectl, Gcloud, AWS CLI, none of them behave this way. It would, in my opinion, be far more appropriate to return a non-zero exit code and a string message to standard error.

Docker:

$ docker stats thisdoesnotexist; echo $?
Error response from daemon: No such container: thisdoesnotexist
1

AWS:

$ aws ec2 describe-instance-status --instance-ids="i-0598c7d356eba48d7"; echo $?
An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstanceStatus operation: The instance ID 'i-0598c7d356eba48d7' does not exist
255

GCP:

$ gcloud compute instances describe notarealinstance --zone=us-west1-a; echo $?
ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
 - The resource 'projects/t2hrowawa2/zones/us-west1-a/instances/notarealinstance' was not found

1

Kubectl:

$ k get pod notarealpod; echo $?
Error from server (NotFound): pods "notarealpod" not found
1

And then Azure:

$ az group show -n thisaintreal; echo $?
0

None of them behave the way the Azure CLI has been made to behave.

This flat out broke scripts that I had written before this change was made. I feel bad for anyone with real production deployments who thought they could upgrade the CLI and expect basic idempotent resource creation scripts to work consistently. (Maybe this was announced as a breaking change and I couldn't find it?)
(There's no Portal involved here anywhere....)
I don't know what set -e has to do with it. I have set -e in all of my bash scripts, including the ones authored when az behaved like other, similar, unix tools.

Some day, it would be nice to write this, and have the CLI just do the right things:

az group ensure -n group0 -l westus2
az network vnet ensure -n net0 -g group0
az network subnet ensure -n subnet0 -v net0 -g group0
...

In lieu of that, an exists subcommand for _all_ resource types would be sufficient.

But whether or not an ensure or explicit exists command is added, I think it's worth reverting this particular behavior. It's probably just as bad to break it back the other direction, but this was jarring and an unsightly additional to the pile of roadbumps to doing basic things in Azure as a Linux developer and user.

Whatever the answer is, please, please, document it and write an integration test so that I don't have to revisit my most basic provisioning scripts in another year.

And yeah, I guess if you're going to keep the existing behavior, it ought to be consistently... blank and exit 0... so that users don't accidentally assume they can rely on exit codes.

colemickens on 28 Apr 2018

I read this and associated #6298, in short, 2 assumptions were involved leading to this change:

show is supposed to be used as the existence check before create.
all services could have consistencies on PUT, say, no error, fast, and support create if not exists, etc.

Both assumptions didn't end up well. The first one is not always true as existence check can be used for other purposes. The second one has never happened. Also some commands fell through the crack that still throw. Lining up this inconsistency with the unconventional behavior thus causes the confusion on end users.
In hindsight, we should not have done this change, or at least wait till the 2 assumptions got confirmed.

For how to address this issue, because the change has already been made and we have way more commands with this new behavior which makes the revert not a feasible option. So, let us just fix the rest commands to at least have the consistency. It sucks though to have to tweak such popular commands like az storage account show in a breaking change manner.