Consul: Agent doesn't use acl_agent_token for service registration from static json config

Created on 1 Aug 2018 · 23Comments · Source: hashicorp/consul

Overview of the Issue

Service registration fails from any agent when ACLs configured according to Consul ACL docs. The doc reads as though the acl_agent_token is the key the agent uses for all agent-local sync operations; I've taken this to include things like service registration from .json files in, say, the -config-dir directory. However, even with the following policy on the ACL for the acl_agent_token, service registration fails:

node "" {
  policy = "write"
}
service "" {
  policy = "write"
}

It's only when I set the token field in my service definition or if I set the acl_token field in the agent config that service registration succeeds. I don't see this called out anywhere in the ACL guide or service documentation, and based on how the acl_agent_token description reads, I would assume that local agent configurations (including static service definitions in config json files) should be using the acl_agent_token.

I would hope to find out if this is expected behavior for acl_agent_token or if the acl_agent_token should indeed be used in the registration of static service configs.

Reproduction Steps

Steps to reproduce this issue, eg:

Use the following config in an agent.json file (or split between 2 - agent and service - which is my setup)

"acl_datacenter": "dc1",
"acl_down_policy": "extend-cache",
"acl_default_policy": "deny",
"acl_master_token": "mytoken",
"acl_agent_token": "myagenttoken",
"service": {
  "name": "myservice",
  "port": 9999
}

Start consul: consul agent -bootstrap -data-dir=./data -config-dir=.
Run ACL update, specifying ID: curl http://localhost:8500/v1/acl/update?token=mytoken -X PUT --data '{"ID": "myagenttoken", "Name":"Server token", "Type": "client", "Rules": "node \"\" { policy = \"write\" } service \"\" { policy = \"write\" }"}'
Restart consul
Observe that the log reports [WARN] agent: Service "myservice" registration blocked by ACLs
Set "acl_token": "myagenttoken" in the agent config
Restart consul
Observe that registration succeeds: [INFO] agent: Synced service "myservice"
Stop consul & delete data dir
Repeat steps 1-5
Set "token": "myagenttoken" in the service config
Restart consul
Observe that registration succeeds: [INFO] agent: Synced service "myservice"

Consul info for both Client and Server

Client and Server info (they are the same for the purposes of this test)

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease = 
        revision = e716d1b5
        version = 1.2.2
consul:
        bootstrap = true
        known_datacenters = 1
        leader = true
        leader_addr = scrubbed
        server = true
raft:
        applied_index = 15
        commit_index = 15
        fsm_pending = 0
        last_contact = 0
        last_log_index = 15
        last_log_term = 4
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:scrubbed Address:scrubbed}]
        latest_configuration_index = 1
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 4
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 75
        max_procs = 2
        os = linux
        version = go1.10.1
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 1
        event_time = 4
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1

Operating system and Environment details

CentOS/RHEL 6 & 7 machines are affected in my testing. Other OSes weren't tested.

Log Fragments

Please let me know if you would like logs for any or all of the above cases. Not included as this should be reproducible (I've run through it several times myself now).

themacls typdocs

Source

dmarkwat

👍5 👀1

Most helpful comment

Hi @bbaassssiiee, I am currently working on a new guide that will include best practices for creating ACL policies with examples. Are there any security concerns you’d like to have addressed?

kaitlincarter-hc on 24 Sep 2019

❤3

All 23 comments

Hey @dmarkwat thanks for the issue.

We certainly have a lot of complicated options around ACL tokens and the current docs I think are correct but perhaps not super clear depending on how you read them.

What you describe is the intended behaviour.

From the docs (https://www.consul.io/docs/agent/options.html#acl_agent_token):

acl_agent_token - Used for clients and servers to perform internal operations. If this isn't specified, then the acl_token will be used. This was added in Consul 0.7.2.

This token must at least have write access to the node name it will register as in order to set any of the node-level information in the catalog such as metadata, or the node's tagged addresses. There are other places this token is used, please see ACL Agent Token for more details.

To put that another way, this is _only_ used for internal agent operations that are not explicitly initiated by an operator configuration or API call. In practice that means registering the _node_ i.e. having the agent join the cluster in the first place and maybe some other internal things like managing encryption key operations to make Gossip work (I might have made that up but it's a reasonable thing we might use this scoped token for).

acl_token is actually what you want in this case. That is described as (https://www.consul.io/docs/agent/options.html#acl_token):

acl_token - When provided, the agent will use this token when making requests to the Consul servers. Clients can override this token on a per-request basis by providing the "?token" query parameter. When not provided, the empty token, which maps to the 'anonymous' ACL policy, is used.

So in other words, a _service_ registration (whether it is initiated via API call or config file) will either need an explicit token in the API request or in the config file itself as you describe, otherwise it will default to using acl_token.

In general it's a known issue that the ACL docs are not easy to work with and we have a bunch of changes to ACLs on the cards to help which is why we are holding off a little on rewriting them immediately,

That said, if you can point out which specific wording confused you here that would be really useful data for us during that process.

banks on 1 Aug 2018

Thanks for clearing that up! It occurred to me after I posted that, "oh, maybe they mean agent-_only_, not agent-_wide_". So that was my bad! Everything is working for me now and it works exactly as you described. As for where I got tripped-up, I'd say it stems in large part from here in the guide.

The acl_agent_token is a special token that is used for an agent's internal operations. It isn't used directly for any user-initiated operations like the acl_token, though if the acl_agent_token isn't configured the acl_token will be used.

I may have taken that too literally and assumed that everything not user-initiated--from the command line/api--would be handled with the acl_agent_token. No other doc led me to believe that it _didn't_ handle various static local configs, such as service json files, and that may have been where it went off the rails for me. But at the very least, there's a clear issue out here now which would have helped me!

Also sorry for hitting close prematurely--meant to hit preview (oops).

dmarkwat on 8 Aug 2018

👍2

@banks So Consul's ACL model regards service definitions in static config files as registrations by random strangers via HTTP API? That sounds very wrong to me, as wrong as requiring -config-file to be world-writable.

nodakai on 11 Jan 2019

👍2

@nodakai This seemed wrong to me as well so I looked into the code and that is almost accurate. You could have a service def in the config file like:

"service": {
  "name": "myservice",
  "port": 9999,
  "token": "072c2f14-7499-4d1f-8dec-7f2cae00f0e3"
}

This embeds the proper token for registering the service so it wont appear as if a random stranger is attempting to register the service.

With that being said, maybe for service definitions from the config file we should use the agent token or at the very least the default token.

mkeeler on 11 Jan 2019

👍1

I am reopening this to track an enhancement that will use the acl.tokens.default token value for service registrations originating from config files.

mkeeler on 11 Jan 2019

I think it really needs some clarification, see our PR which is a bit related : https://github.com/hashicorp/consul/pull/5217

pierresouchay on 11 Jan 2019

Actually services configured via the config files already use the acl.tokens.default token. Which is the same thing that would happen if the default token was configured and a request to register a service came in with no token.

mkeeler on 11 Jan 2019

@pierrsouchay I agree. I think some clear documentation surrounding what each token gets used for is probably the way to go.

mkeeler on 11 Jan 2019

It seems to me that using acl.tokens.default isn't as good as using either the acl.tokens.agent or perhaps a new configured token type.

If I'm understanding the behavior correctly there is no way currently to specify a token that is used by the agent to register services defined from static config without also leaking the same token's attached ACL policies to everything that can talk to the local consul agent. OTOH if I don't set this default token then I must generate some token which has to be written out into the static service definition files which now must be treated differently because they contain privileged data.

If on the other hand acl.tokens.agent worked as @dmarkwat (and myself) had expected then an operator could set the file permissions of the main consul configuration in such a way as to make it readable only by the consul agent's user AND have a dropdir with static service configurations which could be globally readable. This would have the effect of:

allowing for users to be able to investigate the expected behavior of services/checks as defined from files since the files don't require any protections against reading because they don't need to include tokens inline
allowing/encouraging users (developers) to write/manage their own static service definitions which the operator could drop in the dropdir via config management without having to modify the file to add a token nor set read-only permissions to protect the token
disallowing arbitrary service registration by users that would inherit the default token if one were set which had a policy allowing service registration (which seems to be the suggested path in this thread)

so this still seems like an enhancement request to me. Either apply acl.tokens.agent to registrations made by the agent on startup from files OR introduce something like acl.tokens.agent_services to apply in the same situation.

drawks on 14 Jan 2019

Actually services configured via the config files already use the acl.tokens.default token.

I'm having trouble playing with the ACL system in 1.4.2 and not seeing the same result. I have a server setup with the following config:

{
    "acl": {
        "enabled": true,
        "default_policy": "allow",
        "down_policy": "allow",
        "tokens": {
            "agent": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
        }
    }
}

That ACL token is defined as follows with the given policy:

vagrant@n1:~$ consul acl token read -id 86ed6770-5d80-6f29-ffb2-c0cc553184b9
AccessorID:   86ed6770-5d80-6f29-ffb2-c0cc553184b9
SecretID:     c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb
Description:  Agent Token
Local:        false
Create Time:  2019-02-24 21:40:06.506618598 +0000 UTC
Policies:
   1e9d2fd5-3ac2-0899-45ae-41853d114eef - agent-token

vagrant@n1:~$ consul acl policy read -id 1e9d2fd5-3ac2-0899-45ae-41853d114eef
ID:           1e9d2fd5-3ac2-0899-45ae-41853d114eef
Name:         agent-token
Description:  Agent Token Policy
Datacenters:  
Rules:
node_prefix "" {
   policy = "write"
}
service_prefix "" {
   policy = "read"
}

These were created following the bootstrap guide. I setup a client with the following config:

vagrant@n2:~$ cat /etc/consul.d/acl.json 
{
    "acl": {
        "enabled": true,
        "default_policy": "allow",
        "down_policy": "allow",
        "tokens": {
            "agent": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb",
            "default": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
        }
    },
    "acl_agent_token": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb",
    "acl_token": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
}
vagrant@n2:~$ cat /etc/consul.d/web.json 
{
    "service": {
        "name": "web3", 
        "tags": ["rails"], 
        "port": 80,
        "token": "c8fc2f6b-0fd5-93ff-cfd5-1924ca080afb"
    }
}

acl_agent_token and acl_token don't even seem appropriate in the acl.json, but I tossed them in to try as well. When I try to join this client to the server and register this service, I see the following:

vagrant@n2:~$ consul agent -data-dir=/tmp/consul -node=agent-three     -bind=172.20.20.11 -enable-script-checks=true -config-dir=/etc/consul.d -join 172.20.20.10 -log-level=trace
==> Starting Consul agent...
==> Joining cluster...
    Join completed. Synced with 1 initial agents
==> Consul agent running!
           Version: 'v1.4.2'
           Node ID: 'a5e9b8cf-29e7-a21b-569a-a2e84722bd02'
         Node name: 'agent-three'
        Datacenter: 'dc1' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 172.20.20.11 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2019/02/24 22:14:41 [INFO] serf: EventMemberJoin: agent-three 172.20.20.11
    2019/02/24 22:14:41 [DEBUG] agent/proxy: managed Connect proxy manager started
    2019/02/24 22:14:41 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
    2019/02/24 22:14:41 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
    2019/02/24 22:14:41 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
    2019/02/24 22:14:41 [INFO] agent: (LAN) joining: [172.20.20.10]
    2019/02/24 22:14:41 [DEBUG] memberlist: Initiating push/pull sync with: 172.20.20.10:8301
    2019/02/24 22:14:41 [INFO] serf: EventMemberJoin: agent-one 172.20.20.10
    2019/02/24 22:14:41 [DEBUG] serf: Refuting an older leave intent
    2019/02/24 22:14:41 [INFO] agent: (LAN) joined: 1 Err: <nil>
    2019/02/24 22:14:41 [DEBUG] agent: systemd notify failed: No socket
    2019/02/24 22:14:41 [INFO] consul: adding server agent-one (Addr: tcp/172.20.20.10:8300) (DC: dc1)
    2019/02/24 22:14:41 [INFO] agent: started state syncer
    2019/02/24 22:14:41 [ERR] consul: "Catalog.Register" RPC failed to server 172.20.20.10:8300: rpc error making call: Permission denied
    2019/02/24 22:14:41 [WARN] agent: Service "web3" registration blocked by ACLs
    2019/02/24 22:14:41 [INFO] agent: Synced node info
    2019/02/24 22:14:41 [DEBUG] agent: Service "web3" in sync
    2019/02/24 22:14:41 [DEBUG] agent: Node info in sync
    2019/02/24 22:14:41 [DEBUG] acl: transition out of legacy ACL mode
    2019/02/24 22:14:41 [INFO] serf: EventMemberUpdate: agent-three
    2019/02/24 22:14:41 [DEBUG] serf: messageJoinType: agent-three
    2019/02/24 22:14:41 [DEBUG] serf: messageJoinType: agent-three
    2019/02/24 22:14:41 [DEBUG] serf: messageJoinType: agent-three
    2019/02/24 22:14:41 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
    2019/02/24 22:14:41 [ERR] consul: "Catalog.Register" RPC failed to server 172.20.20.10:8300: rpc error making call: Permission denied
    2019/02/24 22:14:41 [WARN] agent: Service "web3" registration blocked by ACLs
    2019/02/24 22:14:41 [DEBUG] agent: Node info in sync
    2019/02/24 22:14:42 [DEBUG] serf: messageJoinType: agent-three

Note the contradictory warn and debug lines:

2019/02/24 22:14:41 [WARN] agent: Service "web3" registration blocked by ACLs

2019/02/24 22:14:41 [DEBUG] agent: Service "web3" in sync

So, this web3 service is not registered, despite the service entry having a token definition, and the acl body having both an agent and default token definition - and note also the default_policy is allow for both the agents.

What sort of extra info can I provide to help debug this? The documentation is not real clear about how to proceed.

p0pr0ck5 on 24 Feb 2019

In your example you granted the token a policy that only allows read on all services:

service_prefix "" {
   policy = "read"
}

If you intend to use that to register into the catalog it will need write on the service names that you are registering.

rboyer on 25 Feb 2019

@rboyer yes, I realized that about half an hour after posting here. My apologies :)

FWIW, I was able to duplicate the behavior reported above (https://github.com/hashicorp/consul/issues/4478#issuecomment-454138085) - acl.tokens.agent is not used, which seems incorrect- now anyone that can contact an agent via the HTTP API can essentially coerce the agent to perform any action, rather than the agent reading in the token presented by the client.

p0pr0ck5 on 27 Feb 2019

This kinda sucks.. I just blew an hour trying to figure out why this wasn't working, only to hit the issue where it was using the default policy to try and create the service.

devicenull on 20 Jun 2019

Maybe another blog can be written with the ACL best practices with examples. Now there is blogging about the (new) ACL since 1.4, And the learning guide which explains the general mechanics. What is missing is explanation of a consistent setup that is secure.

bbaassssiiee on 24 Sep 2019

👍2

Hi @bbaassssiiee, I am currently working on a new guide that will include best practices for creating ACL policies with examples. Are there any security concerns you’d like to have addressed?

kaitlincarter-hc on 24 Sep 2019

❤3

@kaitlincarter-hc the biggest concern we have is the requirement to use default if you can't provide a per-service token, which then basically allows for the whole ACL system to be circumvented by making a request directly to a more-privileged agent. BTW I mentioned this in my talk on ACLs at hashiconf, would be happy to chat offline in detail about it.

p0pr0ck5 on 24 Sep 2019

👍1

Consul ACL is very generic, once the default is set to deny we can create agents with a policy & token each. But we need to figure out what needs to be in place to be able to register services in a way that fits a security model. I get the impression that there are multiple models possible. I am looking for a way to use ACL by distinct teams that need to interface services in various permutations. So limited use of master tokens, and a way to delegate service registration.

bbaassssiiee on 24 Sep 2019

👍2

Since this is about documentation, let me just chime in that docs mention specifying "tokens" for configuration and commands in many places and I have not found any clarification on if AccessorID, SecretID or a pair of the two is to be specified. It seems obvious that policies are configured with AccessorID and those who want access will use SecretID.

I still haven't figured out how to correctly declare tokens in config.json on agents. That they're both UUIDs of the same form doesn't help.

Legogris on 16 Feb 2020

Hot off the press is a blog post https://www.hashicorp.com/blog/security-guides-for-consul/

bbaassssiiee on 18 Feb 2020

👍1

@bbaassssiiee Great stuff above! However, it only talks about policies, which were already pretty obvious. What's still unclear is what to supply when configuring agent tokens. Let's see the various docs I'm able to find:

https://www.consul.io/docs/commands/acl/set-agent-token.html

Usage: consul acl set-agent-token [options] TYPE TOKEN

https://learn.hashicorp.com/consul/security-networking/production-acls#add-the-token-to-the-agent

$ consul acl set-agent-token agent "<agent token here>"

https://www.consul.io/api/agent.html#update-acl-tokens

Token (string: "") - Specifies the ACL token to set

https://www.consul.io/docs/agent/options.html#acl_tokens

Only explicit example is enterprise-only parameter managed_service_provider, which specifies both together:

  "managed_service_provider": [
       {
           "accessor_id": "ed22003b-0832-4e48-ac65-31de64e5c2ff",
           "secret_id": "cb6be010-bba8-4f30-a9ed-d347128dde17"
       }
  ]

However, it seems that this is specific to managed_service_provider as the APIs for the rest only show a single UUID.

https://www.consul.io/docs/acl/acl-system.html

Nothing here either.

https://learn.hashicorp.com/consul/security-networking/acl-troubleshooting

Nope.

https://learn.hashicorp.com/consul/day-0/acl-guide#step-5-add-the-agent-token-to-all-the-servers

Finally, here, we can deduce by matching the sample values that the SecretID is used for agent and assume that the same goes for the rest.

AccessorID:   499ab022-27f2-acb8-4e05-5a01fff3b1d1
SecretID:     da666809-98ca-0e94-a99c-893c4bf5f9eb

    "tokens": {
      "agent": "da666809-98ca-0e94-a99c-893c4bf5f9eb"
    }

So after this, my conclusion is that when docs talk specifying "tokens", this means:

AccesorID in ACL policies
SecretID in static node configuration files and the set-agent-token command
Except managed_service_provider which expects a dictionary of form {"secretid": "", "accessorid": ""} .

I'm still not 100% on the above and and if you have problems bootstrapping the ACL system it's hard to know if it's because of misconfigured agent token.

Unless I missed something glaringly obvious it's only the source code left at this point. (I still haven't resolved all issues with bootstrapping the ACL, setting off the above digging to ensure that it's not due to misconfigured agent tokens. It's quite confusing to newcomers that docs, guides and examples use "token" to refer to accessorids or secretids interchangeably and the above is implicitly implied)

Legogris on 19 Feb 2020

👀1

If it helps: I created an ansible-role to install consul policies https://github.com/dockpack/consul-policies/tree/master/templates

bbaassssiiee on 21 Feb 2020

My use case is to setup ACL to authorize DNS requests but to not authorize the access to the service and host even in read only mode. Currently it's not possible. I need the DNS feature but we have currently a security scan tool which detects issue when the user web interface of consul is not protected by ACL (even in read only).