Sometimes when stopping a Nomad job which uses the Vault Consul engine to get an ACL token the service fails to deregister from Consul with the below error message. Note the empty accessorID.
2020-05-18T13:13:40.600Z [WARN] agent: Service deregistration blocked by ACLs: service=api-e827a3f70da4-9997 accessorID=
2020-05-18T13:13:40.601Z [WARN] agent: Check deregistration blocked by ACLs: check=service:api-e827a3f70da4-9997:2 accessorID=
2020-05-18T13:13:40.602Z [WARN] agent: Check deregistration blocked by ACLs: check=api-e827a3f70da4-9997-ttl accessorID=
In my case the job is the fabio proxy which self-registers the ttl health check (i.e. not done by Nomad as far as I know).
consul service deregister or via hashi-ui which uses the same rest api. (Seems to be specifically "TTL" checks registered by both fabio and rabbitmq that end up as zombies for some reason)Only way to get rid of the zombie service and the error message in the Consul logs is to do consul leave on each of our server nodes and then restart consul.
Running versions:
This issue is possibly linked to #7669
Having an accessorID being blank indicates that either the request was made with the anonymous token (changed in 1.8.0 to output the accessor id of the anonymous token) or that the token used for the registration has been deleted.
My guess here is that Nomad deleted the Consul token at about the same time as deregistering the service with the Consul agent. Then when the Consul agent went to fully remove the service from the Catalog the token was no longer valid for use.
Thinking out loud a bit here, but I am wondering if Consul should unconditionally perform deregistrations during anti-entropy with the agent token instead of the token used to register/deregister the service with the local agent?
Having an accessorID being blank indicates that either the request was made with the anonymous token (changed in 1.8.0 to output the accessor id of the anonymous token) or that the token used for the registration has been deleted.
My guess here is that Nomad deleted the Consul token at about the same time as deregistering the service with the Consul agent. Then when the Consul agent went to fully remove the service from the Catalog the token was no longer valid for use.
Thinking out loud a bit here, but I am wondering if Consul should unconditionally perform deregistrations during anti-entropy with the agent token instead of the token used to register/deregister the service with the local agent?
Your mention that this is Nomad clearing out the ACL token before Consul has had the chance to use it to deregister the service seems quite plausible since I assume the anonymous token isn't used at all in this case where Nomad fetches the Consul token with specific policies from Vault when running the job.
On initial though using the agent token to force removal of services during anti-entropy seems like a good idea. I don't know if there would be any drawbacks of that approach except you're giving Consul higher permissions to deregister any service by using the agent token in this situation?
For us this is becoming a pretty big problem because it keeps occurring a lot, like several times a day if were trying out things in the cluster (starting and stopping jobs, testing shutdown of nodes etc). Having these zombie services remain is really annoying, and the workaround of having to restart the servers one by one to clear out the service and its health checks is iffy at best.
We had this in the past and sometimes have this issue as well (while very unfrequently since we took measures to avoid it).
Our main original issue was agent being re-installed cleaned up from all services, hence the patch we made: https://github.com/hashicorp/consul/pull/5217
With this PR, the server now accepts the deregistration if performed with own's agent token (since we can consider the agent is authoritative regarding the services it is not hosting, for registration, it is different). So, if the agent accept deregistration, but get denied access during anti-entropy, probably it should retry with its node agent to effectively remove it. From a security perspective, I think it makes sense as this token is enough to effectively leave the agent and thus removing all its services at once.
We had this issue yesterday for a very long running service for which ACL had been removed (but the case of temp ACLs definitely make sense of course)
@mkeeler I also have this issue. From documentation https://www.consul.io/docs/agent/checks :
Checks may also contain a token field to provide an ACL token. This token is used for any interaction with the catalog for the check, including anti-entropy syncs and deregistration.
So consul agent caches Consul token used for check registration. I periodically rotate Consul tokens for Nomad clients and hence very long running service for which ACL token had been removed can't be deregistered from catalog. I agree with @pierresouchay that it makes sense to retry with node agent token.