Nomad: Missing allocation resource use metrics

Created on 5 Jul 2019 · 19Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Operating system and Environment details

Server: Ubuntu 18.04
Client: Windows 2019
Driver: Docker

Issue

There doesn't appear to be any allocation resource usage metrics exposed after adding to both server and client's configuration:

telemetry {
  collection_interval        = "1s"
  disable_hostname           = true
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

I've tried querying the server and clients, there doesn't appear to be any nomad.client metrics exposed on the server and on the clients there is only basic totals, none of the allocation metrics as listed on https://www.nomadproject.io/docs/telemetry/metrics.html#allocation-metrics.

Server Metrics

$ curl --silent  https://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name' | uniq
"nomad.nomad.autopilot.failure_tolerance"
"nomad.nomad.autopilot.healthy"
"nomad.nomad.blocked_evals.total_blocked"
"nomad.nomad.blocked_evals.total_escaped"
"nomad.nomad.blocked_evals.total_quota_limit"
"nomad.nomad.broker._core.ready"
"nomad.nomad.broker._core.unacked"
"nomad.nomad.broker.service.ready"
"nomad.nomad.broker.service.unacked"
"nomad.nomad.broker.total_blocked"
"nomad.nomad.broker.total_ready"
"nomad.nomad.broker.total_unacked"
"nomad.nomad.broker.total_waiting"
"nomad.nomad.heartbeat.active"
"nomad.nomad.job_summary.complete"
"nomad.nomad.job_summary.failed"
"nomad.nomad.job_summary.lost"
"nomad.nomad.job_summary.queued"
"nomad.nomad.job_summary.running"
"nomad.nomad.job_summary.starting"
"nomad.nomad.plan.queue_depth"
"nomad.nomad.vault.distributed_tokens_revoking"
"nomad.nomad.vault.token_ttl"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"

Client Metrics

$ curl --silent http://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name'
"nomad.client.allocated.cpu"
"nomad.client.allocated.disk"
"nomad.client.allocated.memory"
"nomad.client.allocated.network"
"nomad.client.allocations.blocked"
"nomad.client.allocations.migrating"
"nomad.client.allocations.pending"
"nomad.client.allocations.running"
"nomad.client.allocations.terminal"
"nomad.client.unallocated.cpu"
"nomad.client.unallocated.disk"
"nomad.client.unallocated.memory"
"nomad.client.unallocated.network"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"

themmetrics themplatform-windows

Source

damoxc

👍4 👀1

Most helpful comment

We're also seeing this affecting our Nomad deployment with Ubuntu Linux 16.04 clients and Nomad v0.9.2

mgeggie on 18 Sep 2019

👍3 🚀2 👀1

All 19 comments

Are the any jobs registered with running allocations on the client that you're talking to?

For example:

$ http localhost:7646/v1/metrics | jq  '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'
"nomad.client.allocs.cpu.system"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_periods"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_time"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_percent"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_ticks"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.user"
"ed6849f8-877e-4c42-f36a-1653206d4266"

cgbaker on 5 Jul 2019

@cgbaker
Running the following to hit all 6 clients yields no results:

$ for clientAddr in x x x x x x; do curl --silent http://x.x.x.${clientAddr}:4646/v1/metrics | jq  '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'; done;

There are certainly running allocations on all the up clients:

$ nomad node status -allocs
ID        DC              Name        Class   Drain  Eligibility  Status  Running Allocs
9e144753  europe-west2-b  nomad-g42v  <none>  false  eligible     ready   6
30a68744  europe-west2-c  nomad-gmk2  <none>  false  eligible     ready   25
f6f5e1dd  europe-west2-c  nomad-d48d  <none>  false  eligible     ready   27
92def6f4  europe-west2-b  nomad-rfz4  <none>  false  eligible     ready   24
5f7c3e2e  europe-west2-b  nomad-g42v  <none>  false  eligible     down    0
6963fbfe  europe-west2-a  nomad-bw8h  <none>  false  eligible     ready   24
444905c6  europe-west2-a  nomad-0g3x  <none>  false  eligible     ready   27

damoxc on 6 Jul 2019

@damoxc , there is a reported issue where, under load on the client node, all of the nomad.client.* metrics are missing due to (I recall) a blocking call to a Windows API that isn't returning. I did some trivial testing on a single-node Windows cluster and the allocation metrics were returned, so I'm working under the theory that there's some circumstantial difference needed to reproduce this issue. Any insight you have for reproducing this would be appreciated.

cgbaker on 15 Jul 2019

👍1

@cgbaker are you able to share the configuration files for your simple example so I could compare them to what I have?

damoxc on 16 Jul 2019

They weren't saved before the node was torn down, but the telemetry section was the same as you posted above. I will spin up a cluster and try again.

cgbaker on 17 Jul 2019

I've added the same telemetry block to our production cluster, which is configured nearly identically and that is exhibiting the same problem, so at least it is consistent and not just something random with our dev cluster.

damoxc on 18 Jul 2019

And to be clear, the prod cluster has the same configuration:

ubuntu servers
windows clients with the telemetry stanza
docker tasks

cgbaker on 18 Jul 2019

Yes. We also have:

Raft encryption
ACL tokens
TLS

damoxc on 18 Jul 2019

We're also seeing this affecting our Nomad deployment with Ubuntu Linux 16.04 clients and Nomad v0.9.2

mgeggie on 18 Sep 2019

👍3 🚀2 👀1

@damoxc @mgeggie Could you share any example job files that you're not seeing metrics for?

endocrimes on 18 Sep 2019

@endocrimes We've actually had jobs start out successfully reporting metrics, only to later have all allocation usage metrics stop on a nomad-client server. Here's an example job we use for deploying traefik:

job "traefik" {
  datacenters = ["use1"]
  type        = "service"
  group "traefik" {

    count = 3

    update {
      max_parallel     = 1
      min_healthy_time = "20s"
      healthy_deadline = "2m"
      auto_revert      = true
      stagger          = "30s"
    }

#     migrate {
#       max_parallel = 1
#       health_check = "checks"
#       min_healthy_time = "20s"
#       healthy_deadline = "1m"
#     }

    task "traefik" {
      driver = "docker"
      kill_timeout = "35s"

      env {
        CONSUL_HTTP_ADDR = "http://169.254.1.1:8500"
      }

      config {
        image        = "traefik:v1.7.11"
        args = [
                  "--api",
                  "--ping",
                  "--ping.entrypoint=http",
                  "--consulcatalog.endpoint=169.254.1.1:8500",
                  "--metrics.prometheus.entrypoint=traefik",
                  "--traefikLog.format=common",
                  "--accessLog.format=common",
                  "--lifecycle.requestacceptgracetimeout=20s",
                  "--lifecycle.gracetimeout=10s",
                ]
        port_map {
          http  = 80
          webui = 8080
        }
      }

      resources {
        cpu = 1000 # MHz
        memory = 2048 # MB
        network {
          mbits = 500
          port "http" {
            static = 80
          }
          port "webui" { }
        }
      }

      service {
        name = "traefik"
        port = "webui"
        check {
          name     = "ping"
          type     = "http"
          port     = "http"
          path     = "/ping"
          interval = "5s"
          timeout  = "2s"
        }
      }

    }
  }
}

In fact, for this job, we have 3 instances, where two are currently reporting metrics, and one is not, each on its own Nomad-client server. None of the allocations on the affected Nomad-client server are reporting allocation utilization metrics.

mgeggie on 18 Sep 2019

👍2

@mgeggie That's very useful - thank you!

Could you possibly send us any client logs from the bad clients to [email protected]? - The lower level the logging the better, but for this one anything is useful.

endocrimes on 19 Sep 2019

@endocrimes I sent logs over to [email protected].

mgeggie on 19 Sep 2019

@mgeggie Thanks! - I think #6349 should help with a fair amount of your case, as it looks like you get blocked on collecting host disk stats for broader client allocation metrics. Individual allocation metrics I'm still unsure of though.

endocrimes on 21 Sep 2019

Thanks @endocrimes . We've seen that host disk collection error since starting our Nomad cluster a few months back. I'll check out #6349 to see about resolving the issue.

Also note, we've restarted the Nomad process on our afflicted Nomad client, and allocation stats that had been reporting and stopped did not recover after restarting the process.

We're about to upgrade our Nomad cluster to 0.9.5. I'll report back on the status of our afflicted nodes once that's complete.

mgeggie on 23 Sep 2019

Hi @endocrimes just an update on our upgrade; after upgrading our troublesome nomadclient to 0.9.5 and rebooting the server, allocation resource use metrics are once again being produced.

mgeggie on 26 Sep 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale[bot] on 6 Feb 2020

Hey @tgross I only received the waiting-reply label and notification and the issue was closed in the same hour and I'm still curious how our cluster entered a state where it stopped producing allocation metrics. I can report that in the 3 months since upgrading to Nomad-0.9.5 we haven't had any further issues of losing allocation metrics.

mgeggie on 6 Feb 2020

Hi @mgeggie sorry about the confusion over that. I saw the notification from the bot and it looked to me like the issue had been resolved with the upgrade?

I know we updated the prometheus and libcontainer clients in 0.9.4 so the issue may have been upstream, but we also improved CPU utilization for busy clusters in that same release and have seen Nomad CPU utilization impact metrics collection for other users. I've recently added extra testing for both host and allocation metrics collection in our end-to-end test suite and I've got an open ticket to upgrade our go-psutils library in the 0.11 release cycle so that we get improvements from upstream on collecting host information. Hope that gives you some context!