Service registration fails from any agent when ACLs configured according to Consul ACL docs. The doc reads as though the acl_agent_token is the key the agent uses for all agent-local sync operations; I've taken this to include things like service registration from .json files in, say, the -config-dir directory. However, even with the following policy on the ACL for the acl_agent_token, service registration fails:
node "" {
policy = "write"
}
service "" {
policy = "write"
}
It's only when I set the token field in my service definition or if I set the acl_token field in the agent config that service registration succeeds. I don't see this called out anywhere in the ACL guide or service documentation, and based on how the acl_agent_token description reads, I would assume that local agent configurations (including static service definitions in config json files) should be using the acl_agent_token.
I would hope to find out if this is expected behavior for acl_agent_token or if the acl_agent_token should indeed be used in the registration of static service configs.
Steps to reproduce this issue, eg:
"acl_datacenter": "dc1",
"acl_down_policy": "extend-cache",
"acl_default_policy": "deny",
"acl_master_token": "mytoken",
"acl_agent_token": "myagenttoken",
"service": {
"name": "myservice",
"port": 9999
}
consul agent -bootstrap -data-dir=./data -config-dir=.curl http://localhost:8500/v1/acl/update?token=mytoken -X PUT --data '{"ID": "myagenttoken", "Name":"Server token", "Type": "client", "Rules": "node \"\" { policy = \"write\" } service \"\" { policy = \"write\" }"}'[WARN] agent: Service "myservice" registration blocked by ACLs"acl_token": "myagenttoken" in the agent config[INFO] agent: Synced service "myservice""token": "myagenttoken" in the service config[INFO] agent: Synced service "myservice"
Client and Server info (they are the same for the purposes of this test)
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 1
build:
prerelease =
revision = e716d1b5
version = 1.2.2
consul:
bootstrap = true
known_datacenters = 1
leader = true
leader_addr = scrubbed
server = true
raft:
applied_index = 15
commit_index = 15
fsm_pending = 0
last_contact = 0
last_log_index = 15
last_log_term = 4
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:scrubbed Address:scrubbed}]
latest_configuration_index = 1
num_peers = 0
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 4
runtime:
arch = amd64
cpu_count = 2
goroutines = 75
max_procs = 2
os = linux
version = go1.10.1
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 1
event_time = 4
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
CentOS/RHEL 6 & 7 machines are affected in my testing. Other OSes weren't tested.
Please let me know if you would like logs for any or all of the above cases. Not included as this should be reproducible (I've run through it several times myself now).
Hey @dmarkwat thanks for the issue.
We certainly have a lot of complicated options around ACL tokens and the current docs I think are correct but perhaps not super clear depending on how you read them.
What you describe is the intended behaviour.
From the docs (https://www.consul.io/docs/agent/options.html#acl_agent_token):
acl_agent_token - Used for clients and servers to perform internal operations. If this isn't specified, then the acl_token will be used. This was added in Consul 0.7.2.
This token must at least have write access to the node name it will register as in order to set any of the node-level information in the catalog such as metadata, or the node's tagged addresses. There are other places this token is used, please see ACL Agent Token for more details.
To put that another way, this is _only_ used for internal agent operations that are not explicitly initiated by an operator configuration or API call. In practice that means registering the _node_ i.e. having the agent join the cluster in the first place and maybe some other internal things like managing encryption key operations to make Gossip work (I might have made that up but it's a reasonable thing we might use this scoped token for).
acl_token is actually what you want in this case. That is described as (https://www.consul.io/docs/agent/options.html#acl_token):
acl_token - When provided, the agent will use this token when making requests to the Consul servers. Clients can override this token on a per-request basis by providing the "?token" query parameter. When not provided, the empty token, which maps to the 'anonymous' ACL policy, is used.
So in other words, a _service_ registration (whether it is initiated via API call or config file) will either need an explicit token in the API request or in the config file itself as you describe, otherwise it will default to using acl_token.
In general it's a known issue that the ACL docs are not easy to work with and we have a bunch of changes to ACLs on the cards to help which is why we are holding off a little on rewriting them immediately,
That said, if you can point out which specific wording confused you here that would be really useful data for us during that process.
Thanks for clearing that up! It occurred to me after I posted that, "oh, maybe they mean agent-_only_, not agent-_wide_". So that was my bad! Everything is working for me now and it works exactly as you described. As for where I got tripped-up, I'd say it stems in large part from here in the guide.
The acl_agent_token is a special token that is used for an agent's internal operations. It isn't used directly for any user-initiated operations like the acl_token, though if the acl_agent_token isn't configured the acl_token will be used.
I may have taken that too literally and assumed that everything not user-initiated--from the command line/api--would be handled with the acl_agent_token. No other doc led me to believe that it _didn't_ handle various static local configs, such as service json files, and that may have been where it went off the rails for me. But at the very least, there's a clear issue out here now which would have helped me!
Also sorry for hitting close prematurely--meant to hit preview (oops).
@banks So Consul's ACL model regards service definitions in static config files as registrations by random strangers via HTTP API? That sounds very wrong to me, as wrong as requiring -config-file to be world-writable.
@nodakai This seemed wrong to me as well so I looked into the code and that is almost accurate. You could have a service def in the config file like:
"service": {
"name": "myservice",
"port": 9999,
"token": "072c2f14-7499-4d1f-8dec-7f2cae00f0e3"
}
This embeds the proper token for registering the service so it wont appear as if a random stranger is attempting to register the service.
With that being said, maybe for service definitions from the config file we should use the agent token or at the very least the default token.
I am reopening this to track an enhancement that will use the acl.tokens.default token value for service registrations originating from config files.
I think it really needs some clarification, see our PR which is a bit related : https://github.com/hashicorp/consul/pull/5217
Actually services configured via the config files already use the acl.tokens.default token. Which is the same thing that would happen if the default token was configured and a request to register a service came in with no token.
@pierrsouchay I agree. I think some clear documentation surrounding what each token gets used for is probably the way to go.
It seems to me that using acl.tokens.default isn't as good as using either the acl.tokens.agent or perhaps a new configured token type.
If I'm understanding the behavior correctly there is no way currently to specify a token that is used by the agent to register services defined from static config without also leaking the same token's attached ACL policies to everything that can talk to the local consul agent. OTOH if I don't set this default token then I must generate some token which has to be written out into the static service definition files which now must be treated differently because they contain privileged data.
If on the other hand acl.tokens.agent worked as @dmarkwat (and myself) had expected then an operator could set the file permissions of the main consul configuration in such a way as to make it readable only by the consul agent's user AND have a dropdir with static service configurations which could be globally readable. This would have the effect of:
so this still seems like an enhancement request to me. Either apply acl.tokens.agent to registrations made by the agent on startup from files OR introduce something like acl.tokens.agent_services to apply in the same situation.
Actually services configured via the config files already use the
acl.tokens.defaulttoken.
I'm having trouble playing with the ACL system in 1.4.2 and not seeing the same result. I have a server setup with the following config:
{
"acl": {
"enabled": true,
"default_policy": "allow",
"down_policy": "allow",
"tokens": {
"agent": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
}
}
}
That ACL token is defined as follows with the given policy:
vagrant@n1:~$ consul acl token read -id 86ed6770-5d80-6f29-ffb2-c0cc553184b9
AccessorID: 86ed6770-5d80-6f29-ffb2-c0cc553184b9
SecretID: c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb
Description: Agent Token
Local: false
Create Time: 2019-02-24 21:40:06.506618598 +0000 UTC
Policies:
1e9d2fd5-3ac2-0899-45ae-41853d114eef - agent-token
vagrant@n1:~$ consul acl policy read -id 1e9d2fd5-3ac2-0899-45ae-41853d114eef
ID: 1e9d2fd5-3ac2-0899-45ae-41853d114eef
Name: agent-token
Description: Agent Token Policy
Datacenters:
Rules:
node_prefix "" {
policy = "write"
}
service_prefix "" {
policy = "read"
}
These were created following the bootstrap guide. I setup a client with the following config:
vagrant@n2:~$ cat /etc/consul.d/acl.json
{
"acl": {
"enabled": true,
"default_policy": "allow",
"down_policy": "allow",
"tokens": {
"agent": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb",
"default": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
}
},
"acl_agent_token": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb",
"acl_token": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
}
vagrant@n2:~$ cat /etc/consul.d/web.json
{
"service": {
"name": "web3",
"tags": ["rails"],
"port": 80,
"token": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
}
}
acl_agent_token and acl_token don't even seem appropriate in the acl.json, but I tossed them in to try as well. When I try to join this client to the server and register this service, I see the following:
vagrant@n2:~$ consul agent -data-dir=/tmp/consul -node=agent-three -bind=172.20.20.11 -enable-script-checks=true -config-dir=/etc/consul.d -join 172.20.20.10 -log-level=trace
==> Starting Consul agent...
==> Joining cluster...
Join completed. Synced with 1 initial agents
==> Consul agent running!
Version: 'v1.4.2'
Node ID: 'a5e9b8cf-29e7-a21b-569a-a2e84722bd02'
Node name: 'agent-three'
Datacenter: 'dc1' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 172.20.20.11 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2019/02/24 22:14:41 [INFO] serf: EventMemberJoin: agent-three 172.20.20.11
2019/02/24 22:14:41 [DEBUG] agent/proxy: managed Connect proxy manager started
2019/02/24 22:14:41 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/02/24 22:14:41 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/02/24 22:14:41 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/02/24 22:14:41 [INFO] agent: (LAN) joining: [172.20.20.10]
2019/02/24 22:14:41 [DEBUG] memberlist: Initiating push/pull sync with: 172.20.20.10:8301
2019/02/24 22:14:41 [INFO] serf: EventMemberJoin: agent-one 172.20.20.10
2019/02/24 22:14:41 [DEBUG] serf: Refuting an older leave intent
2019/02/24 22:14:41 [INFO] agent: (LAN) joined: 1 Err: <nil>
2019/02/24 22:14:41 [DEBUG] agent: systemd notify failed: No socket
2019/02/24 22:14:41 [INFO] consul: adding server agent-one (Addr: tcp/172.20.20.10:8300) (DC: dc1)
2019/02/24 22:14:41 [INFO] agent: started state syncer
2019/02/24 22:14:41 [ERR] consul: "Catalog.Register" RPC failed to server 172.20.20.10:8300: rpc error making call: Permission denied
2019/02/24 22:14:41 [WARN] agent: Service "web3" registration blocked by ACLs
2019/02/24 22:14:41 [INFO] agent: Synced node info
2019/02/24 22:14:41 [DEBUG] agent: Service "web3" in sync
2019/02/24 22:14:41 [DEBUG] agent: Node info in sync
2019/02/24 22:14:41 [DEBUG] acl: transition out of legacy ACL mode
2019/02/24 22:14:41 [INFO] serf: EventMemberUpdate: agent-three
2019/02/24 22:14:41 [DEBUG] serf: messageJoinType: agent-three
2019/02/24 22:14:41 [DEBUG] serf: messageJoinType: agent-three
2019/02/24 22:14:41 [DEBUG] serf: messageJoinType: agent-three
2019/02/24 22:14:41 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/02/24 22:14:41 [ERR] consul: "Catalog.Register" RPC failed to server 172.20.20.10:8300: rpc error making call: Permission denied
2019/02/24 22:14:41 [WARN] agent: Service "web3" registration blocked by ACLs
2019/02/24 22:14:41 [DEBUG] agent: Node info in sync
2019/02/24 22:14:42 [DEBUG] serf: messageJoinType: agent-three
Note the contradictory warn and debug lines:
2019/02/24 22:14:41 [WARN] agent: Service "web3" registration blocked by ACLs
2019/02/24 22:14:41 [DEBUG] agent: Service "web3" in sync
So, this web3 service is not registered, despite the service entry having a token definition, and the acl body having both an agent and default token definition - and note also the default_policy is allow for both the agents.
What sort of extra info can I provide to help debug this? The documentation is not real clear about how to proceed.
In your example you granted the token a policy that only allows read on all services:
service_prefix "" {
policy = "read"
}
If you intend to use that to register into the catalog it will need write on the service names that you are registering.
@rboyer yes, I realized that about half an hour after posting here. My apologies :)
FWIW, I was able to duplicate the behavior reported above (https://github.com/hashicorp/consul/issues/4478#issuecomment-454138085) - acl.tokens.agent is not used, which seems incorrect- now anyone that can contact an agent via the HTTP API can essentially coerce the agent to perform any action, rather than the agent reading in the token presented by the client.
This kinda sucks.. I just blew an hour trying to figure out why this wasn't working, only to hit the issue where it was using the default policy to try and create the service.
Maybe another blog can be written with the ACL best practices with examples. Now there is blogging about the (new) ACL since 1.4, And the learning guide which explains the general mechanics. What is missing is explanation of a consistent setup that is secure.
Hi @bbaassssiiee, I am currently working on a new guide that will include best practices for creating ACL policies with examples. Are there any security concerns you鈥檇 like to have addressed?
@kaitlincarter-hc the biggest concern we have is the requirement to use default if you can't provide a per-service token, which then basically allows for the whole ACL system to be circumvented by making a request directly to a more-privileged agent. BTW I mentioned this in my talk on ACLs at hashiconf, would be happy to chat offline in detail about it.
Consul ACL is very generic, once the default is set to deny we can create agents with a policy & token each. But we need to figure out what needs to be in place to be able to register services in a way that fits a security model. I get the impression that there are multiple models possible. I am looking for a way to use ACL by distinct teams that need to interface services in various permutations. So limited use of master tokens, and a way to delegate service registration.
Since this is about documentation, let me just chime in that docs mention specifying "tokens" for configuration and commands in many places and I have not found any clarification on if AccessorID, SecretID or a pair of the two is to be specified. It seems obvious that policies are configured with AccessorID and those who want access will use SecretID.
I still haven't figured out how to correctly declare tokens in config.json on agents. That they're both UUIDs of the same form doesn't help.
Hot off the press is a blog post https://www.hashicorp.com/blog/security-guides-for-consul/
@bbaassssiiee Great stuff above! However, it only talks about policies, which were already pretty obvious. What's still unclear is what to supply when configuring agent tokens. Let's see the various docs I'm able to find:
Usage: consul acl set-agent-token [options] TYPE TOKEN
$ consul acl set-agent-token agent "<agent token here>"
Token (string: "") - Specifies the ACL token to set
Only explicit example is enterprise-only parameter managed_service_provider, which specifies both together:
"managed_service_provider": [
{
"accessor_id": "ed22003b-0832-4e48-ac65-31de64e5c2ff",
"secret_id": "cb6be010-bba8-4f30-a9ed-d347128dde17"
}
]
However, it seems that this is specific to managed_service_provider as the APIs for the rest only show a single UUID.
Nothing here either.
Nope.
Finally, here, we can deduce by matching the sample values that the SecretID is used for agent and assume that the same goes for the rest.
AccessorID: 499ab022-27f2-acb8-4e05-5a01fff3b1d1
SecretID: da666809-98ca-0e94-a99c-893c4bf5f9eb
"tokens": {
"agent": "da666809-98ca-0e94-a99c-893c4bf5f9eb"
}
So after this, my conclusion is that when docs talk specifying "tokens", this means:
AccesorID in ACL policiesSecretID in static node configuration files and the set-agent-token commandmanaged_service_provider which expects a dictionary of form {"secretid": "", "accessorid": ""} .I'm still not 100% on the above and and if you have problems bootstrapping the ACL system it's hard to know if it's because of misconfigured agent token.
Unless I missed something glaringly obvious it's only the source code left at this point. (I still haven't resolved all issues with bootstrapping the ACL, setting off the above digging to ensure that it's not due to misconfigured agent tokens. It's quite confusing to newcomers that docs, guides and examples use "token" to refer to accessorids or secretids interchangeably and the above is implicitly implied)
If it helps: I created an ansible-role to install consul policies https://github.com/dockpack/consul-policies/tree/master/templates
My use case is to setup ACL to authorize DNS requests but to not authorize the access to the service and host even in read only mode. Currently it's not possible. I need the DNS feature but we have currently a security scan tool which detects issue when the user web interface of consul is not protected by ACL (even in read only).
You could disable the web interface on 8500/8501 and firewall the DNS on 8600.
Most helpful comment
Hi @bbaassssiiee, I am currently working on a new guide that will include best practices for creating ACL policies with examples. Are there any security concerns you鈥檇 like to have addressed?