Nomad v0.7.0
CoreOS 1492.1.0 with nomad running in alpine container
When starting a clean agent nomad often doesn't add vault attributes which then isn't available for tasks which require vault tokens. After restarting nomad attributes got added.
None, but feels like it happens in 3 of 5 cases
Loaded configuration from /cluster_config/config/nomad/config.hcl, /agent.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: true
Log Level: INFO
Region: global (DC: gc-europe-west1)
Server: false
Version: 0.7.0
==> Nomad agent started! Log data will stream in below:
2017/11/07 18:33:46.724548 [INFO] client: using state directory /var/lib/docker/volumes/nomad/_data/client
2017/11/07 18:33:46.726639 [INFO] client: using alloc directory /var/lib/docker/volumes/nomad/_data/alloc
2017/11/07 18:33:46.727810 [INFO] fingerprint.cgroups: cgroups are available
2017/11/07 18:33:46.733447 [INFO] fingerprint.consul: consul agent is available
2017/11/07 18:33:46.781415 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
2017/11/07 18:33:46.782124 [INFO] client: Node ID "63302955-2775-7da1-6cf0-9c480e1cf438"
2017/11/07 18:33:46.784333 [ERR] client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
2017/11/07 18:33:49.600585 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:34:00.060668 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:34:10.063242 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:34:11.961070 [ERR] client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
2017/11/07 18:34:20.064665 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:34:27.911722 [ERR] client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
2017/11/07 18:34:30.067475 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:34:40.069866 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:34:50.072724 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:34:56.749201 [ERR] client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
2017/11/07 18:35:00.074834 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:35:02.267875 [INFO] fingerprint.vault: Vault is available
2017/11/07 18:35:10.076315 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/07 18:35:15.959869 [INFO] client.consul: discovered following Servers: 10.242.0.10:4647,10.242.0.11:4647,10.242.0.12:4647
2017/11/07 18:35:15.970857 [INFO] client: node registration complete
nomad node-status -verbose after registration
Attributes
consul.datacenter = gc-europe-west1
consul.revision = 51ea240
consul.server = false
consul.version = 1.0.0
cpu.arch = amd64
cpu.frequency = 2300
cpu.modelname = Intel(R) Xeon(R) CPU @ 2.30GHz
cpu.numcores = 4
cpu.totalcompute = 9200
docker.privileged.enabled = 1
driver.docker = 1
driver.docker.bridge_ip = 172.17.0.1
driver.docker.version = 17.05.0-ce
driver.docker.volumes.enabled = 1
driver.exec = 1
driver.raw_exec = 1
kernel.name = linux
kernel.version = 4.12.4-coreos-r1
memory.totalbytes = 15774470144
nomad.revision = <none>
nomad.version = 0.7.0
os.name = alpine
os.signals = SIGCONT,SIGINT,SIGPROF,SIGTTOU,SIGWINCH,SIGABRT,SIGIOT,SIGSEGV,SIGTRA
P,SIGUSR1,SIGCHLD,SIGSTOP,SIGTERM,SIGXCPU,SIGIO,SIGURG,SIGXFSZ,SIGALRM,SIGBUS,SIGPIPE,SIGSYS,SIGTTIN,SIGFPE,SIGKILL,SIGILL,SIGTSTP,S
IGUSR2,SIGHUP,SIGQUIT
os.version = 3.6.2
platform.gce.attr.nextjournal = AGENT_GROUP=shared
platform.gce.cpu-platform = Intel Haswell
platform.gce.machine-type = n1-standard-4
platform.gce.network.global-net-staging = true
platform.gce.scheduling.automatic-restart = TRUE
platform.gce.scheduling.on-host-maintenance = MIGRATE
platform.gce.tag.agent = true
platform.gce.tag.cluster-config = true
platform.gce.tag.consul-agent = true
platform.gce.tag.europe-west1-d = true
platform.gce.tag.no-ip = true
platform.gce.tag.nomad-agent = true
platform.gce.tag.shared-agent = true
platform.gce.tag.staging = true
platform.gce.zone = europe-west1-d
unique.cgroup.mountpoint = /sys/fs/cgroup
unique.consul.name = agent-2-staging.c.nextjournal-com.internal
unique.hostname = agent-2-staging.c.nextjournal-com.internal
unique.network.ip-address = 10.242.0.6
unique.platform.gce.hostname = agent-2-staging.c.nextjournal-com.internal
unique.platform.gce.id = 3971726229872853660
unique.platform.gce.network.global-net-staging.external-ip.0 = <none>
unique.platform.gce.network.global-net-staging.ip = 10.242.0.6
unique.storage.bytesfree = 986028699648
unique.storage.bytestotal = 1037829509120
unique.storage.volume = /dev/sda9
Meta
agent_group = shared
environment = staging
gpu_count = 0
runner_instance_id = <none>
Logs after restarting:
Loaded configuration from /cluster_config/config/nomad/config.hcl, /agent.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: true
Log Level: INFO
Region: global (DC: gc-europe-west1)
Server: false
Version: 0.7.0
==> Nomad agent started! Log data will stream in below:
2017/11/07 18:39:10.509500 [INFO] client: using state directory /var/lib/docker/volumes/nomad/_data/client
2017/11/07 18:39:10.509632 [INFO] client: using alloc directory /var/lib/docker/volumes/nomad/_data/alloc
2017/11/07 18:39:10.510642 [INFO] fingerprint.cgroups: cgroups are available
2017/11/07 18:39:10.518293 [INFO] fingerprint.consul: consul agent is available
2017/11/07 18:39:10.526386 [INFO] fingerprint.vault: Vault is available
2017/11/07 18:39:10.549027 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
2017/11/07 18:39:10.555850 [INFO] client.consul: discovered following Servers: 10.242.0.10:4647,10.242.0.11:4647,10.242.0.12:4647
2017/11/07 18:39:10.556586 [INFO] client: Node ID "63302955-2775-7da1-6cf0-9c480e1cf438"
2017/11/07 18:39:10.566308 [INFO] client: node registration complete
and nomad node-status -verbose after restart
Attributes
consul.datacenter = gc-europe-west1
consul.revision = 51ea240
consul.server = false
consul.version = 1.0.0
cpu.arch = amd64
cpu.frequency = 2300
cpu.modelname = Intel(R) Xeon(R) CPU @ 2.30GHz
cpu.numcores = 4
cpu.totalcompute = 9200
docker.privileged.enabled = 1
driver.docker = 1
driver.docker.bridge_ip = 172.17.0.1
driver.docker.version = 17.05.0-ce
driver.docker.volumes.enabled = 1
driver.exec = 1
driver.raw_exec = 1
kernel.name = linux
kernel.version = 4.12.4-coreos-r1
memory.totalbytes = 15774470144
nomad.revision = <none>
nomad.version = 0.7.0
os.name = alpine
os.signals = SIGSTOP,SIGTRAP,SIGCHLD,SIGHUP,SIGSEGV,SIGUSR1,SIGWINCH,SIGQUIT,SIGTT
IN,SIGTTOU,SIGFPE,SIGINT,SIGPIPE,SIGSYS,SIGTSTP,SIGCONT,SIGILL,SIGTERM,SIGALRM,SIGKILL,SIGUSR2,SIGIOT,SIGURG,SIGABRT,SIGBUS,SIGIO,SI
GPROF,SIGXCPU,SIGXFSZ
os.version = 3.6.2
platform.gce.attr.nextjournal = AGENT_GROUP=shared
platform.gce.cpu-platform = Intel Haswell
platform.gce.machine-type = n1-standard-4
platform.gce.network.global-net-staging = true
platform.gce.scheduling.automatic-restart = TRUE
platform.gce.scheduling.on-host-maintenance = MIGRATE
platform.gce.tag.agent = true
platform.gce.tag.cluster-config = true
platform.gce.tag.consul-agent = true
platform.gce.tag.europe-west1-d = true
platform.gce.tag.no-ip = true
platform.gce.tag.nomad-agent = true
platform.gce.tag.shared-agent = true
platform.gce.tag.staging = true
platform.gce.zone = europe-west1-d
unique.cgroup.mountpoint = /sys/fs/cgroup
unique.consul.name = agent-2-staging.c.nextjournal-com.internal
unique.hostname = agent-2-staging.c.nextjournal-com.internal
unique.network.ip-address = 10.242.0.6
unique.platform.gce.hostname = agent-2-staging.c.nextjournal-com.internal
unique.platform.gce.id = 3971726229872853660
unique.platform.gce.network.global-net-staging.external-ip.0 = <none>
unique.platform.gce.network.global-net-staging.ip = 10.242.0.6
unique.storage.bytesfree = 983862984704
unique.storage.bytestotal = 1037829509120
unique.storage.volume = /dev/sda9
vault.accessible = true
vault.cluster_id = 03a578ff-2020-9be8-6e1b-30c8027a37dd
vault.cluster_name = vault-cluster-031035f1
vault.version = 0.8.3
Meta
agent_group = shared
environment = staging
gpu_count = 0
runner_instance_id = <none>
agent config
server {
enabled = false
}
leave_on_interrupt = true
leave_on_terminate = true
client {
enabled = true
options {
"driver.raw_exec.enable" = "1"
"docker.cleanup.image" = false
"docker.privileged.enabled" = true
"docker.volumes.enabled" = true
}
meta {
"environment" = "staging"
"gpu_count" = "0"
"agent_group" = "shared"
"runner_instance_id" = ""
}
}
data_dir = "/var/lib/docker/volumes/nomad/_data"
log_level = "INFO"
disable_update_check = true
region = "global"
consul {
address = "consul.service.consul:8500"
ssl = true
verify_ssl = true
ca_file = "/cluster_config/certs/consul/ca.cert"
cert_file = "/cluster_config/certs/consul/consul.cert"
key_file = "/cluster_config/certs/consul/consul.key"
}
vault {
enabled = true
address = "http://vault.service.consul:8200"
}
telemetry {
publish_allocation_metrics = true
publish_node_metrics = true
prometheus_metrics = true
}
In the initial startup is appears Nomad and Consul are starting concurrently because a few Consul errors are logged before it finally is able to connect and discover the servers.
Is it possible this is on system boot and Nomad, Consul, and Vault are starting concurrently?
Yes, that鈥榮 on system boot and they start concurrently, but just nomad and consul since it's an agent
Some more logs with consul
```==> Found address '10.242.0.6' for interface 'ens4v1', setting bind option...
==> Found address '10.242.0.6' for interface 'ens4v1', setting client option...
==> Starting Consul agent...
==> Consul agent running!
Version: 'v1.0.0'
Node ID: '8b61c0fb-175a-8c02-8bbb-3e498d778dfb'
Node name: 'agent-2-staging.c.nextjournal-com.internal'
Datacenter: 'gc-europe-west1' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [10.242.0.6] (HTTP: 8501, HTTPS: 8500, DNS: 8600)
Cluster Addr: 10.242.0.6 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true
==> Log data will now stream in as it occurs:
2017/11/09 11:03:31 [INFO] serf: EventMemberJoin: agent-2-staging.c.nextjournal-com.internal 10.242.0.6
2017/11/09 11:03:31 [INFO] agent: Started DNS server 10.242.0.6:8600 (udp)
2017/11/09 11:03:31 [INFO] agent: Started DNS server 10.242.0.6:8600 (tcp)
2017/11/09 11:03:31 [INFO] agent: Started HTTP server on 127.0.0.1:8501 (tcp)
2017/11/09 11:03:31 [INFO] agent: Started HTTPS server on 10.242.0.6:8500 (tcp)
2017/11/09 11:03:31 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
2017/11/09 11:03:31 [INFO] agent: Joining LAN cluster...
2017/11/09 11:03:31 [INFO] discover-gce: Project name is "nextjournal-com"
2017/11/09 11:03:31 [INFO] discover-gce: Loading credentials from /etc/journal-gc.json
2017/11/09 11:03:31 [WARN] manager: No servers available
2017/11/09 11:03:31 [ERR] agent: failed to sync remote state: No known Consul servers
2017/11/09 11:03:31 [INFO] discover-gce: Looking up zones matching europe-west1.+
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] http: Request GET /v1/status/leader, error: No known Consul servers from=10.242.0.6:44984
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] dns: rpc error: No known Consul servers
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] dns: rpc error: No known Consul servers
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] dns: rpc error: No known Consul servers
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] dns: rpc error: No known Consul servers
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] http: Request GET /v1/status/leader, error: No known Consul servers from=10.242.0.6:44990
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] http: Request GET /v1/catalog/datacenters, error: No known Consul servers from=10.242.0.6:44988
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] agent: failed to sync changes: No known Consul servers
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] agent: failed to sync changes: No known Consul servers
2017/11/09 11:03:32 [INFO] discover-gce: Found zones [europe-west1-d europe-west1-b europe-west1-c]
2017/11/09 11:03:32 [WARN] manager: No servers available
2017/11/09 11:03:32 [ERR] http: Request GET /v1/status/leader, error: No known Consul servers from=10.242.0.6:45020
2017/11/09 11:03:33 [INFO] discover-gce: Zone "europe-west1-d" has [10.242.0.5 10.242.0.6 10.242.0.13 10.242.0.3 10.242.0.11 10.242.0.8 10.242.0.4 10.2
2017/11/09 11:03:33 [INFO] discover-gce: Zone "europe-west1-b" has []
2017/11/09 11:03:33 [WARN] manager: No servers available
2017/11/09 11:03:33 [ERR] http: Request GET /v1/status/leader, error: No known Consul servers from=10.242.0.6:45024
2017/11/09 11:03:33 [WARN] agent: Check '83133417dfb9c40c36f30ce4006ded28a45b0e60' is now critical
2017/11/09 11:03:34 [WARN] manager: No servers available
2017/11/09 11:03:34 [ERR] http: Request GET /v1/status/leader, error: No known Consul servers from=10.242.0.6:45028
2017/11/09 11:03:34 [INFO] discover-gce: Zone "europe-west1-c" has [10.242.0.7 10.242.0.9 10.242.0.10 10.242.0.12]
2017/11/09 11:03:34 [INFO] agent: Discovered LAN servers: 10.242.0.5 10.242.0.6 10.242.0.13 10.242.0.3 10.242.0.11 10.242.0.8 10.242.0.4 10.242.0.2 10.
2017/11/09 11:03:34 [INFO] agent: (LAN) joining: [10.242.0.5 10.242.0.6 10.242.0.13 10.242.0.3 10.242.0.11 10.242.0.8 10.242.0.4 10.242.0.2 10.242.0.7
2017/11/09 11:03:34 [INFO] serf: EventMemberJoin: master-b3-staging.c.nextjournal-com.internal 10.242.0.12
2017/11/09 11:03:34 [INFO] serf: EventMemberJoin: loadbalancer-2-staging.c.nextjournal-com.internal 10.242.0.3
2017/11/09 11:03:34 [INFO] serf: EventMemberJoin: loadbalancer-1-staging.c.nextjournal-com.internal 10.242.0.9
2017/11/09 11:03:34 [INFO] serf: EventMemberJoin: build-staging.c.nextjournal-com.internal 10.242.0.7
2017/11/09 11:03:34 [INFO] consul: adding server master-b3-staging.c.nextjournal-com.internal (Addr: tcp/10.242.0.12:8300) (DC: gc-europe-west1)
2017/11/09 11:03:34 [INFO] serf: EventMemberJoin: master-b2-staging.c.nextjournal-com.internal 10.242.0.11
2017/11/09 11:03:34 [INFO] serf: EventMemberJoin: postgres-1-staging.c.nextjournal-com.internal 10.242.0.4
2017/11/09 11:03:34 [INFO] consul: adding server master-b2-staging.c.nextjournal-com.internal (Addr: tcp/10.242.0.11:8300) (DC: gc-europe-west1)
2017/11/09 11:03:38 [INFO] agent: (LAN) joined: 12 Err: <nil>
2017/11/09 11:03:38 [INFO] agent: Join LAN completed. Synced with 12 initial agents
2017/11/09 11:03:39 [INFO] agent: Synced service '_nomad-client-nomad-client-http'
2017/11/09 11:03:39 [INFO] agent: Synced service 'agent-2-staging.c.nextjournal-com.internal:dnsmasq:5353'
2017/11/09 11:03:39 [INFO] agent: Synced service 'agent-2-staging.c.nextjournal-com.internal:node-exporter:9100'
2017/11/09 11:03:39 [INFO] agent: Synced service 'agent-2-staging.c.nextjournal-com.internal:cadvisor:8080'
2017/11/09 11:03:43 [WARN] agent: Check '83133417dfb9c40c36f30ce4006ded28a45b0e60' is now critical
2017/11/09 11:03:53 [INFO] agent: Synced check '83133417dfb9c40c36f30ce4006ded28a45b0e60'
2017/11/09 11:05:07 [INFO] agent: Synced service 'agent-2-staging.c.nextjournal-com.internal:cadvisor:8080'
2017/11/09 11:05:07 [INFO] agent: Synced service 'agent-2-staging.c.nextjournal-com.internal:dnsmasq:5353'
2017/11/09 11:05:07 [INFO] agent: Synced service 'agent-2-staging.c.nextjournal-com.internal:node-exporter:9100'
2017/11/09 11:05:07 [INFO] agent: Synced check '83133417dfb9c40c36f30ce4006ded28a45b0e60'
2017/11/09 11:06:50 [INFO] agent: Synced check '83133417dfb9c40c36f30ce4006ded28a45b0e60'
Loaded configuration from /cluster_config/config/nomad/config.hcl, /agent.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: true
Log Level: INFO
Region: global (DC: gc-europe-west1)
Server: false
Version: 0.7.0
==> Nomad agent started! Log data will stream in below:
2017/11/09 11:03:32.565699 [INFO] client: using state directory /var/lib/docker/volumes/nomad/_data/client
2017/11/09 11:03:32.568644 [INFO] client: using alloc directory /var/lib/docker/volumes/nomad/_data/alloc
2017/11/09 11:03:32.570026 [INFO] fingerprint.cgroups: cgroups are available
2017/11/09 11:03:32.577981 [INFO] fingerprint.consul: consul agent is available
2017/11/09 11:03:32.621005 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
2017/11/09 11:03:32.623135 [INFO] client: Node ID "71505866-469a-0095-33e0-7f6eb9a17ef5"
2017/11/09 11:03:32.624395 [ERR] client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
2017/11/09 11:03:33.730826 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/09 11:03:43.732382 [ERR] http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
2017/11/09 11:03:47.607893 [INFO] fingerprint.vault: Vault is available
2017/11/09 11:03:48.031128 [INFO] client.consul: discovered following Servers: 10.242.0.10:4647,10.242.0.11:4647,10.242.0.12:4647
2017/11/09 11:03:48.041272 [INFO] client: node registration complete
2017-11-09T11:04:02.586Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/var/lib/docker/volumes/nomad/_data/alloc/44ecd0a0-a064-c408-87aa-0946dbe092a5/prefetch/executor.out",
"LogLevel":"INFO"}]"
2017-11-09T11:04:02.587Z [DEBUG] plugin: waiting for RPC address: path=/usr/local/bin/nomad
2017-11-09T11:04:02.608Z [DEBUG] plugin.nomad: plugin address: timestamp=2017-11-09T11:04:02.608Z address=/tmp/plugin638569547 network=unix
2017/11/09 11:04:02.646730 [INFO] driver.docker: created container e78e6721bbe1a7685884b1750b2a0a1bcfa8b6afe4f84e2a8d2d4de9adfbc267
2017/11/09 11:04:03.025481 [INFO] driver.docker: started container e78e6721bbe1a7685884b1750b2a0a1bcfa8b6afe4f84e2a8d2d4de9adfbc267
2017/11/09 11:04:03.038865 [WARN] client: error fetching stats of task prefetch: stats collection hasn't started yet
2017-11-09T11:04:08.880Z [DEBUG] plugin: plugin process exited: path=/usr/local/bin/nomad
2017/11/09 11:04:08.890697 [INFO] client: task "prefetch" for alloc "44ecd0a0-a064-c408-87aa-0946dbe092a5" completed successfully
2017/11/09 11:04:08.890725 [INFO] client: Not restarting task: prefetch for alloc: 44ecd0a0-a064-c408-87aa-0946dbe092a5
2017/11/09 11:04:08.903704 [INFO] client: marking allocation 44ecd0a0-a064-c408-87aa-0946dbe092a5 for GC
nomad node-status -verbose 71505866-469a-0095-33e0-7f6eb9a17ef5 |grep vault
docker restart nomad
nomad node-status -verbose 71505866-469a-0095-33e0-7f6eb9a17ef5 |grep vault
vault.accessible = true
vault.cluster_id = 03a578ff-2020-9be8-6e1b-30c8027a37dd
vault.cluster_name = vault-cluster-031035f1
vault.version = 0.8.3
```
We're can confirm that this looks like nomad <-> consul parallel start issue and vault.service.consul config in nomad.
We're now building our images with static DNS config, we'll see if it helps.
EDIT: I can now confirm that after hardcoding non-consul Vault's DNS address 7 out of 7 test instances booted with correct vault attributes.
We found another workaround by waiting for consul to return a cluster leader before starting nomad
while ! curl -sf http://127.0.0.1:8500/v1/status/leader; do
echo "$(date) Error while querying consul api for leader"
sleep 1
done
exec /usr/local/bin/nomad agent -client ...
@hamann your version, while uglier might be a better idea in the end.
when we put vault.our.tld that had 3 A records, and restarted one of the vault servers it brought down half of our cluster down into initializing state. even though 2 remaining vault servers were healthy and unsealed.
we had to unseal it for nomad to stabilize
which is not very HA
@hamann I have not been able to reproduce this issue. I used vault.service.consul as the vault config address. I set up dnsmasq to delegate .consul queries to Consul. I restarted both Nomad and Consul simultaneously. I also tried shutting down and restarting vault while Consul and Nomad were running.
In my testing, as long as dig vault.service.consul SRV returns SRV records indicating vault was up and at port 8200, nomad node-status -verbose returned vault.accessible=true. Only when Vault was either unsealed, or Consul was still restarting did the vault information in the status go away but it always came back after Vault was healthy and the DNS query worked. This confirms that the fingerprinter in Nomad reconnects to Vault every 15 seconds.
There may be other DNS caching factors at play here that are local to your environment. It could also be that your vault instance was being unsealed/initialized around the same time Nomad/Consul restarted.
@preetapan Thanks for your answer! Our setup is almost the same, with dnsmasq redirecting *.consul to local consul agent (127.0.0.1)
Vault cluster was healthy at all times - but in our environment our local consul agent sometimes takes a while (up to 80 seconds or even more) to discover other consul servers via GCE due to unstable network directly after boot (GCE requests timeout because network isn't yet correctly configured or drops packages and so on). My workaround from above added stability. It's a bit tricky to build a test stack for that, but I'll try it.
@hamann were you able to reproduce this again? I haven't been able to since my last comment and so at this point this doesn't look like a nomad fingerprinter bug. Will close this out, we can always reopen if this is reproducible again.
I've just encountered this issue while standing up a pair of new (Nomad 0.7.0, Consul agent 0.7.5) Nomad clusters. Our logging situation looks the same: nomad starts before consul has completely settled down, resulting in the vault attributes not being set in node-status even though the logs show that Vault became available prior to node registration.
One thing I noticed in trying to reproduce the issue is that once it successfully tells the server agent that Vault is available, subsequent reboots don't reproduce the issue. Unfortunately that has made further testing difficult.
We, too, lookup vault as vault.service.consul.
Nomad startup logs on a startup which exhibited the issue:
Dec 7 17:38:31 stg-nomadusw2-client04 nomad[2138]: client: using state directory /var/lib/nomad/client
Dec 7 17:38:31 stg-nomadusw2-client04 nomad[2138]: client: using alloc directory /var/lib/nomad/alloc
Dec 7 17:38:31 stg-nomadusw2-client04 nomad[2138]: fingerprint.cgroups: cgroups are available
Dec 7 17:38:31 stg-nomadusw2-client04 nomad[2138]: fingerprint.consul: consul agent is available
Dec 7 17:38:31 stg-nomadusw2-client04 nomad[2138]: fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
Dec 7 17:38:32 stg-nomadusw2-client04 nomad[2138]: client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
Dec 7 17:38:32 stg-nomadusw2-client04 nomad[2138]: client: Node ID "e167a706-151c-9e27-08d5-8d3ebec7fb4c"
Dec 7 17:38:32 stg-nomadusw2-client04 nomad[2138]: client: marking allocation 65bbd4bb-c51a-5db4-17cd-17fa96b1324e for GC
Dec 7 17:38:32 stg-nomadusw2-client04 nomad[2138]: client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
Dec 7 17:38:32 stg-nomadusw2-client04 nomad[2138]: client: marking allocation a5868170-8435-cccb-d5cb-5221e5fee09d for GC
Dec 7 17:38:32 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:38:34 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 17:38:39 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:38:40 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/node/e167a706-151c-9e27-08d5-8d3ebec7fb4c/allocations, error: no servers
Dec 7 17:38:45 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:38:45 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 17:38:53 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:38:55 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 17:39:00 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:39:00 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/node/e167a706-151c-9e27-08d5-8d3ebec7fb4c/allocations, error: no servers
Dec 7 17:39:05 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 17:39:08 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:39:15 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:39:15 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 17:39:22 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:39:25 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 17:39:31 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:39:32 stg-nomadusw2-client04 nomad[2138]: fingerprint.vault: Vault is available
Dec 7 17:39:35 stg-nomadusw2-client04 nomad[2138]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 17:39:39 stg-nomadusw2-client04 nomad[2138]: client: failed to update allocations: no servers
Dec 7 17:39:40 stg-nomadusw2-client04 nomad[2138]: client.consul: discovered following Servers: 54.190.194.158:4647,54.212.193.32:4647,34.210.84.10:4647
Dec 7 17:39:50 stg-nomadusw2-client04 nomad[2138]: client: node registration complete
And again on a restart on a running host:
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: using state directory /var/lib/nomad/client
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: using alloc directory /var/lib/nomad/alloc
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: fingerprint.cgroups: cgroups are available
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: fingerprint.consul: consul agent is available
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: fingerprint.vault: Vault is available
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: Node ID "e167a706-151c-9e27-08d5-8d3ebec7fb4c"
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: marking allocation 65bbd4bb-c51a-5db4-17cd-17fa96b1324e for GC
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: marking allocation a5868170-8435-cccb-d5cb-5221e5fee09d for GC
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client.consul: discovered following Servers: 54.190.194.158:4647,54.212.193.32:4647,34.210.84.10:4647
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: marking allocation 65bbd4bb-c51a-5db4-17cd-17fa96b1324e for GC
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: marking allocation a5868170-8435-cccb-d5cb-5221e5fee09d for GC
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: garbage collecting allocation a5868170-8435-cccb-d5cb-5221e5fee09d due to forced collection
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: garbage collecting allocation 65bbd4bb-c51a-5db4-17cd-17fa96b1324e due to forced collection
Dec 7 17:59:57 stg-nomadusw2-client04 nomad[16072]: client: node registration complete
@wlonkly Did your node get updated to indicate Vault was available after:
Dec 7 17:39:32 stg-nomadusw2-client04 nomad[2138]: fingerprint.vault: Vault is available
@dadgar It didn't get updated, not until the restart. (I was surprised by this, since the node registration happened after vault became available.)
Realized I could reproduce this by
nomad node-status to show the node as downAnd once again I got the following timing on startup:
Dec 7 18:51:31 stg-nomadusw2-client04 nomad[2151]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 18:51:35 stg-nomadusw2-client04 nomad[2151]: fingerprint.vault: Vault is available
Dec 7 18:51:41 stg-nomadusw2-client04 nomad[2151]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 18:51:43 stg-nomadusw2-client04 nomad[2151]: client.consul: discovered following Servers: 54.190.194.158:4647,54.212.193.32:4647,34.210.84.10:4647
Dec 7 18:51:53 stg-nomadusw2-client04 nomad[2151]: client: node registration complete
but no vault attributes on the node.
I just got a sequence which I found interesting:
Dec 7 18:59:56 stg-nomadusw2-client04 nomad[2120]: client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
[...]
Dec 7 19:00:45 stg-nomadusw2-client04 nomad[2120]: client.consul: discovered following Servers: 54.190.194.158:4647,54.212.193.32:4647,34.210.84.10:4647
Dec 7 19:00:48 stg-nomadusw2-client04 nomad[2120]: client: node registration complete
Dec 7 19:00:57 stg-nomadusw2-client04 nomad[2120]: fingerprint.vault: Vault is available
Dec 7 19:01:01 stg-nomadusw2-client04 nomad[2120]: client: node registration complete
In this case, the _second_ node registration after vault became available _did_ register the vault attributes.
@wlonkly Would you be able to test a build if I gave it to you? What are you running, Linux AMD64?
@dadgar Yep, timing is perfect, there's zero workload on these new clusters. Linux amd64.
@wlonkly This should fix it! Would appreciate it if you could run it through your repo case a few times and report back!
nomad.zip
@dadgar After several runs, I'm reliably seeing a second node registration at the same time as the first one, and the vault attributes are available following the two node registrations.
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: client: using state directory /var/lib/nomad/client
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: client: using alloc directory /var/lib/nomad/alloc
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: fingerprint.cgroups: cgroups are available
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: fingerprint.consul: consul agent is available
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: client: Node ID "e167a706-151c-9e27-08d5-8d3ebec7fb4c"
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
Dec 7 19:47:47 stg-nomadusw2-client04 nomad[7752]: client.consul: error discovering nomad servers: client.consul: unable to query Consul datacenters: Unexpected response code: 500 (No known Consul servers)
Dec 7 19:47:57 stg-nomadusw2-client04 nomad[7752]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 19:48:05 stg-nomadusw2-client04 nomad[7752]: fingerprint.vault: Vault is available
Dec 7 19:48:07 stg-nomadusw2-client04 nomad[7752]: http: Request /v1/agent/health?type=client, error: {"client":{"ok":false,"message":"no known servers"}}
Dec 7 19:48:08 stg-nomadusw2-client04 nomad[7752]: client.consul: discovered following Servers: 54.190.194.158:4647,54.212.193.32:4647,34.210.84.10:4647
Dec 7 19:48:11 stg-nomadusw2-client04 nomad[7752]: client: node registration complete
Dec 7 19:48:11 stg-nomadusw2-client04 nomad[7752]: client: node registration complete
Most helpful comment
@dadgar After several runs, I'm reliably seeing a second node registration at the same time as the first one, and the vault attributes are available following the two node registrations.