Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)
Server: Ubuntu 18.04
Client: Windows 2019
Driver: Docker
There doesn't appear to be any allocation resource usage metrics exposed after adding to both server and client's configuration:
telemetry {
collection_interval = "1s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}
I've tried querying the server and clients, there doesn't appear to be any nomad.client metrics exposed on the server and on the clients there is only basic totals, none of the allocation metrics as listed on https://www.nomadproject.io/docs/telemetry/metrics.html#allocation-metrics.
$ curl --silent https://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name' | uniq
"nomad.nomad.autopilot.failure_tolerance"
"nomad.nomad.autopilot.healthy"
"nomad.nomad.blocked_evals.total_blocked"
"nomad.nomad.blocked_evals.total_escaped"
"nomad.nomad.blocked_evals.total_quota_limit"
"nomad.nomad.broker._core.ready"
"nomad.nomad.broker._core.unacked"
"nomad.nomad.broker.service.ready"
"nomad.nomad.broker.service.unacked"
"nomad.nomad.broker.total_blocked"
"nomad.nomad.broker.total_ready"
"nomad.nomad.broker.total_unacked"
"nomad.nomad.broker.total_waiting"
"nomad.nomad.heartbeat.active"
"nomad.nomad.job_summary.complete"
"nomad.nomad.job_summary.failed"
"nomad.nomad.job_summary.lost"
"nomad.nomad.job_summary.queued"
"nomad.nomad.job_summary.running"
"nomad.nomad.job_summary.starting"
"nomad.nomad.plan.queue_depth"
"nomad.nomad.vault.distributed_tokens_revoking"
"nomad.nomad.vault.token_ttl"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"
$ curl --silent http://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name'
"nomad.client.allocated.cpu"
"nomad.client.allocated.disk"
"nomad.client.allocated.memory"
"nomad.client.allocated.network"
"nomad.client.allocations.blocked"
"nomad.client.allocations.migrating"
"nomad.client.allocations.pending"
"nomad.client.allocations.running"
"nomad.client.allocations.terminal"
"nomad.client.unallocated.cpu"
"nomad.client.unallocated.disk"
"nomad.client.unallocated.memory"
"nomad.client.unallocated.network"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"
Are the any jobs registered with running allocations on the client that you're talking to?
For example:
$ http localhost:7646/v1/metrics | jq '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'
"nomad.client.allocs.cpu.system"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_periods"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_time"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_percent"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_ticks"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.user"
"ed6849f8-877e-4c42-f36a-1653206d4266"
@cgbaker
Running the following to hit all 6 clients yields no results:
$ for clientAddr in x x x x x x; do curl --silent http://x.x.x.${clientAddr}:4646/v1/metrics | jq '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'; done;
There are certainly running allocations on all the up clients:
$ nomad node status -allocs
ID DC Name Class Drain Eligibility Status Running Allocs
9e144753 europe-west2-b nomad-g42v <none> false eligible ready 6
30a68744 europe-west2-c nomad-gmk2 <none> false eligible ready 25
f6f5e1dd europe-west2-c nomad-d48d <none> false eligible ready 27
92def6f4 europe-west2-b nomad-rfz4 <none> false eligible ready 24
5f7c3e2e europe-west2-b nomad-g42v <none> false eligible down 0
6963fbfe europe-west2-a nomad-bw8h <none> false eligible ready 24
444905c6 europe-west2-a nomad-0g3x <none> false eligible ready 27
@damoxc , there is a reported issue where, under load on the client node, all of the nomad.client.* metrics are missing due to (I recall) a blocking call to a Windows API that isn't returning. I did some trivial testing on a single-node Windows cluster and the allocation metrics were returned, so I'm working under the theory that there's some circumstantial difference needed to reproduce this issue. Any insight you have for reproducing this would be appreciated.
@cgbaker are you able to share the configuration files for your simple example so I could compare them to what I have?
They weren't saved before the node was torn down, but the telemetry section was the same as you posted above. I will spin up a cluster and try again.
I've added the same telemetry block to our production cluster, which is configured nearly identically and that is exhibiting the same problem, so at least it is consistent and not just something random with our dev cluster.
And to be clear, the prod cluster has the same configuration:
Yes. We also have:
We're also seeing this affecting our Nomad deployment with Ubuntu Linux 16.04 clients and Nomad v0.9.2
@damoxc @mgeggie Could you share any example job files that you're not seeing metrics for?
@endocrimes We've actually had jobs start out successfully reporting metrics, only to later have all allocation usage metrics stop on a nomad-client server. Here's an example job we use for deploying traefik:
job "traefik" {
datacenters = ["use1"]
type = "service"
group "traefik" {
count = 3
update {
max_parallel = 1
min_healthy_time = "20s"
healthy_deadline = "2m"
auto_revert = true
stagger = "30s"
}
# migrate {
# max_parallel = 1
# health_check = "checks"
# min_healthy_time = "20s"
# healthy_deadline = "1m"
# }
task "traefik" {
driver = "docker"
kill_timeout = "35s"
env {
CONSUL_HTTP_ADDR = "http://169.254.1.1:8500"
}
config {
image = "traefik:v1.7.11"
args = [
"--api",
"--ping",
"--ping.entrypoint=http",
"--consulcatalog.endpoint=169.254.1.1:8500",
"--metrics.prometheus.entrypoint=traefik",
"--traefikLog.format=common",
"--accessLog.format=common",
"--lifecycle.requestacceptgracetimeout=20s",
"--lifecycle.gracetimeout=10s",
]
port_map {
http = 80
webui = 8080
}
}
resources {
cpu = 1000 # MHz
memory = 2048 # MB
network {
mbits = 500
port "http" {
static = 80
}
port "webui" { }
}
}
service {
name = "traefik"
port = "webui"
check {
name = "ping"
type = "http"
port = "http"
path = "/ping"
interval = "5s"
timeout = "2s"
}
}
}
}
}
In fact, for this job, we have 3 instances, where two are currently reporting metrics, and one is not, each on its own Nomad-client server. None of the allocations on the affected Nomad-client server are reporting allocation utilization metrics.
@mgeggie That's very useful - thank you!
Could you possibly send us any client logs from the bad clients to [email protected]? - The lower level the logging the better, but for this one anything is useful.
@endocrimes I sent logs over to [email protected].
@mgeggie Thanks! - I think #6349 should help with a fair amount of your case, as it looks like you get blocked on collecting host disk stats for broader client allocation metrics. Individual allocation metrics I'm still unsure of though.
Thanks @endocrimes . We've seen that host disk collection error since starting our Nomad cluster a few months back. I'll check out #6349 to see about resolving the issue.
Also note, we've restarted the Nomad process on our afflicted Nomad client, and allocation stats that had been reporting and stopped did not recover after restarting the process.
We're about to upgrade our Nomad cluster to 0.9.5. I'll report back on the status of our afflicted nodes once that's complete.
Hi @endocrimes just an update on our upgrade; after upgrading our troublesome nomadclient to 0.9.5 and rebooting the server, allocation resource use metrics are once again being produced.
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
Hey @tgross I only received the waiting-reply label and notification and the issue was closed in the same hour and I'm still curious how our cluster entered a state where it stopped producing allocation metrics. I can report that in the 3 months since upgrading to Nomad-0.9.5 we haven't had any further issues of losing allocation metrics.
Hi @mgeggie sorry about the confusion over that. I saw the notification from the bot and it looked to me like the issue had been resolved with the upgrade?
I know we updated the prometheus and libcontainer clients in 0.9.4 so the issue may have been upstream, but we also improved CPU utilization for busy clusters in that same release and have seen Nomad CPU utilization impact metrics collection for other users. I've recently added extra testing for both host and allocation metrics collection in our end-to-end test suite and I've got an open ticket to upgrade our go-psutils library in the 0.11 release cycle so that we get improvements from upstream on collecting host information. Hope that gives you some context!
Most helpful comment
We're also seeing this affecting our Nomad deployment with Ubuntu Linux 16.04 clients and Nomad v0.9.2