Following template scripts generated from the Portal's Automation Script section, the CLI script uses checks for existing resources such as
az group show $resourceGroupName 1> /dev/null
if [ $? != 0 ]; then
echo "Resource group with name" $resourceGroupName "could not be found. Creating new resource group.."
set -e
(
set -x
az group create --name $resourceGroupName --location $resourceGroupLocation 1> /dev/null
)
else
echo "Using existing resource group..."
fi
az group show and az storage account show both return non-zero when the resource is missing and so the above pattern works. If being consistent across resources throughout my scripts, examples such as az storage share show and az vm show return zero regardless of the resource existing or not. Thus cannot be used to determine missing resources.
The issue is that the 'show' action doesn't always return a non-zero code for missing resources when used on other resource types. For consistency and approachability to new commands, I would thing these would behave the same across resources?
az --version 2.0.21
Python (Darwin) 3.6.3
Installed via homebrew
I am quite surprised to see that this is open and not a priority. How are users expected to do the absolute most basic of tasks, such as, (as @damienpontifex demonstrates) ensuring that resources exist and creating them if they don't exist?
How is anyone doing "ensure" logic reliably? How many bugs have been introduced because of this silently stopping to work due to whatever release regressed this and people not realizing that a widespread convention is no longer followed? I ask because it appears that this problem has grown since @damienpontifex filed this originally -- storage account and storage container now behave incorrectly as well.
(edit: Was this announced as a breaking change? I'd argue that it is...)
Is it the CLI's official stance that folks need to be catching and parsing string output in order to script Azure through the official CLI?
Finally, assuming this is a bug and will be fixed, can there please be positive and negative tests added for all resource types to ensure this doesn't happen again?
cc: @tjprescott
Some more research. It looks like this was explicitly intentional:
Here's a dupe where @devigned reported it and it was simply closed: https://github.com/Azure/azure-cli/issues/1845.
It appears that for some resource there is az [resource] exists but this doesn't even extend to the third or fourth resources I tried, so it's clearly not a reliable or consistent method either. Still holding out hope for something better than checking for an empty or whitespace-only string.
A few things:
exists are because those services have that endpoint and the command would return what the service does. What do you mean when you say it doesn't extend to the third or fourth resources you tried?Two reasons for the original change:
1) Differentiate between:
a) The command successfully completed, but there was no matching resource.
b) The command failed for some reason - we don't know if there is a resource.
c) The command successfully completed, and there is a resource.
2) Enable users to use set -e in their bash scripts.
There should not be any parsing of the output needed (unless checking for an empty string constitutes parsing).
The only time where a command should return a failure code on exists is if an empty string does not uniquely indicate that the resource doesn't exists (consider a blob content show like command where it would be impossible to distinguish between an existing blob with empty content and a non-existent blob. If this is not happening, please open a bug.
@colemickens, are you advocating for having a universal "exists" command? The only other alternative I see is to have different error codes for "doesn't exist" and "failed to determine if it exists" - with the downside of this being a breaking change (again) as well as breaking the ability to use set -e in many cases...
However, as the title points out, it is true that not all show commands follow the empty_on_404 convention--however, it is what we promote.
I am surprised and disappointed that we sort of seem to be bike-shedding about rather basic functionality of a CLI, and that apparently, there isn't a prescribed, documented, and tested way of implementing "ensure resource" logic for Azure with the official CLI.
To be specific:
The current behavior doesn't follow conventions of other CLI tools. Docker, Kubectl, Gcloud, AWS CLI, none of them behave this way. It would, in my opinion, be far more appropriate to return a non-zero exit code and a string message to standard error.
Docker:
$ docker stats thisdoesnotexist; echo $?
Error response from daemon: No such container: thisdoesnotexist
1
AWS:
$ aws ec2 describe-instance-status --instance-ids="i-0598c7d356eba48d7"; echo $?
An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstanceStatus operation: The instance ID 'i-0598c7d356eba48d7' does not exist
255
GCP:
$ gcloud compute instances describe notarealinstance --zone=us-west1-a; echo $?
ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
- The resource 'projects/t2hrowawa2/zones/us-west1-a/instances/notarealinstance' was not found
1
Kubectl:
$ k get pod notarealpod; echo $?
Error from server (NotFound): pods "notarealpod" not found
1
And then Azure:
$ az group show -n thisaintreal; echo $?
0
None of them behave the way the Azure CLI has been made to behave.
This flat out broke scripts that I had written before this change was made. I feel bad for anyone with real production deployments who thought they could upgrade the CLI and expect basic idempotent resource creation scripts to work consistently. (Maybe this was announced as a breaking change and I couldn't find it?)
(There's no Portal involved here anywhere....)
I don't know what set -e has to do with it. I have set -e in all of my bash scripts, including the ones authored when az behaved like other, similar, unix tools.
Some day, it would be nice to write this, and have the CLI just do the right things:
az group ensure -n group0 -l westus2
az network vnet ensure -n net0 -g group0
az network subnet ensure -n subnet0 -v net0 -g group0
...
In lieu of that, an exists subcommand for _all_ resource types would be sufficient.
But whether or not an ensure or explicit exists command is added, I think it's worth reverting this particular behavior. It's probably just as bad to break it back the other direction, but this was jarring and an unsightly additional to the pile of roadbumps to doing basic things in Azure as a Linux developer and user.
Whatever the answer is, please, please, document it and write an integration test so that I don't have to revisit my most basic provisioning scripts in another year.
And yeah, I guess if you're going to keep the existing behavior, it ought to be consistently... blank and exit 0... so that users don't accidentally assume they can rely on exit codes.
I read this and associated #6298, in short, 2 assumptions were involved leading to this change:
show is supposed to be used as the existence check before create.Both assumptions didn't end up well. The first one is not always true as existence check can be used for other purposes. The second one has never happened. Also some commands fell through the crack that still throw. Lining up this inconsistency with the unconventional behavior thus causes the confusion on end users.
In hindsight, we should not have done this change, or at least wait till the 2 assumptions got confirmed.
For how to address this issue, because the change has already been made and we have way more commands with this new behavior which makes the revert not a feasible option. So, let us just fix the rest commands to at least have the consistency. It sucks though to have to tweak such popular commands like az storage account show in a breaking change manner.
Take a look at https://github.com/Azure/azure-cli/issues/6612 for updates to this issue and our decision to resolve it.
@williexu Thank you for cross-linking. I'm also quite happy with the decision. Cheers!