Nomad: [0.9.3] Issue accessing allocation logs

Created on 4 Jul 2019 · 10Comments · Source: hashicorp/nomad

Hey everyone,

I'll preface this by saying that this seems intermittent. Sometimes you will be able to nomad logs the allocation, sometimes the call will hang. However it happens on running, failed or completed allocations alike.

When calling nomad alloc logs on an allocation, the call hangs. This also happens on the Web UI where both the direct call to the node, and the call through the master fail.

One thing to note is that logs are available on the filesystem of the clients and can be read without any problems.

I'm not sure whether this is a question or more a bug report as I can't reproduce the issue on other versions of Nomad (0.9.1).

Reproduction steps

$ nomad alloc logs -job myjob

Job file (if appropriate)

Doesn't seem dependent on the job since sometimes it works for one allocation of a job, and not for another one

Nomad Shared config

consul {
  # The address to the Consul agent.
  address = "127.0.0.1:8500"

  # The service name to register the server and client with Consul.
  server_service_name = "nomad"
  client_service_name = "nomadclient"

  # Enables automatically registering the services.
  auto_advertise = true

  # Enabling the server and client to bootstrap using Consul.
  server_auto_join = true
  client_auto_join = true
}

tls {
  http    = true
  rpc     = true
  ca_file = "/cert/cert.crt"
  cert_file = "/etc/ssl/server.eu-central-1.nomad/cert.pem"
  key_file  = "/etc/ssl/server.eu-central-1.nomad/key.pem"
  verify_server_hostname = true
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  datadog_address            = "127.0.0.1:8125"
  disable_hostname           = true
}

enable_syslog = true

acl {
  enabled    = true
  token_ttl  = "30s"
  policy_ttl = "60s"
}

Nomad Server config

datacenter = "datacenter"

data_dir = "/tmp/nomad"

region = "eu-central-1"

server {
  enabled          = true
  bootstrap_expect = 3
  encrypt          = "xxxxxxxx"

  job_gc_threshold        = "12h"
  eval_gc_threshold       = "6h"
  deployment_gc_threshold = "6h"

  authoritative_region = "eu-central-1"
}

vault {
  enabled = true

  address = "https://vault.service.consul"

  token = "XXXXXX"


  create_from_role = "system_nomad"
}

Nomad Client Config

datacenter = "datacenter"
data_dir = "/tmp/nomad"
region = "eu-central-1"

client {
  enabled = true
  meta = {
  }

  options = {
    "docker.auth.config" = "/root/.docker/config.json"
    "docker.caps.whitelist" = "CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP, SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE,NET_ADMIN"
  }


  gc_interval = "10m"
  gc_max_allocs = 100

}


advertise {
}



vault {
  enabled = true
  address = "https://vault.service.consul"
}

Nomad Client logs (if appropriate)

Couldn't find anything meaningful in Nomad logs.
If possible please post relevant logs in the issue.

In the browser I see:

nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to //10.102.3.22:4646/v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout
(anonymous) @ nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164

nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to /v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout

The first thing is expected, the client nodes don't have a certificate that correspond to their IP.

The second call, directly to the master, that will then proxy to the client, is the one that is flaky for us. Sometimes, refreshing is enough to get the logs (in the UI), or making another nomad logs call. Other times, the call will just hang (through curl or nomad logs)

I've been trying to figure it out for a few days at this point and can't find anything that could indicate what's happening.

Thanks

themapi themtls typbug

Source

thehunt33r

Most helpful comment

Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.

notnoop on 12 Jul 2019

🎉5

All 10 comments

What happens if you run the CLI command on the client node where the allocation was scheduled (i.e., where the logs are still present on the file system)?

$ nomad alloc logs -job myjob

cgbaker on 5 Jul 2019

we experience the same behavior, the logs are working intermediately

ilyaiqoqo on 9 Jul 2019

Hey @cgbaker

From the node, I was able to constantly get the logs:

$ NOMAD_ADDR=https://localhost:4646 NOMAD_SKIP_VERIFY=1 nomad logs -stderr $myAlloc

From my machine, pointing NOMAD_ADDR to the masters, Nomad was still only displaying the logs intermittently (~ once every 5 to 10 tries)

thehunt33r on 10 Jul 2019

We have similar problems with both nomad alloc logs and nomad alloc exec. Typically it works fine on the node with the allocation, and it does not work for other nodes. We didn't see this behavior in Nomad 0.9.0, but we started seeing it in 0.9.2 and it persists in 0.9.3. Preetha asked for log/conf/job files, but it looks like you beat me to getting them transferred off of our private network and available to the internet at wild. Thanks for the thorough documentation and bug report.

onomojaku on 10 Jul 2019

Here is the example that @onomojaku referenced:

The alloc 3cdea42e is running on NOMAD_CLIENT_1. The command nomad alloc exec succeeds on NOMAD_CLIENT_1:

NOMAD_CLIENT_1 > nomad alloc exec 3cdea42e bash
root@3cdea42e:/#

Here are the logs:

NOMAD_CLIENT_1 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 15:04:37 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:37.919Z [INFO ] client: task exec  session starting: exec_id=0c3684e1-d128-237f-770a-c2213644878c alloc_id=3cdea42e-c682-c292-06f4-db96c0d8ca2d task=rabbitmq command=[bash] tty=true access_token_name= access_token_id=
Jul 10 15:04:40 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:40.028Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true duration=2.108800229s

However, the command nomad alloc exec fails on NOMAD_CLIENT_2 (where the alloc 3cdea42e is not running):

NOMAD_CLIENT_2 > nomad alloc exec 3cdea42e bash
failed to exec into task: write tcp NOMAD_SERVER_1:43064->NOMAD_SERVER_2:4647: write: connection reset by peer

Here are the logs:

NOMAD_CLIENT_2 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true error="write tcp NOMAD_SERVER_IP_1:57470->NOMAD_SERVER_IP_2:4647: write: connection reset by peer" code=500
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true duration=16.958344ms

Of note, creating a websocket with

 NOMAD_CLIENT_2 > curl -X GET \
  --include \
  --no-buffer \
  --verbose \
  --header "Connection: Upgrade" \
  --header "Upgrade: websocket" \
  --header "Sec-Websocket-Version: 13" \
  --header "Sec-WebSocket-Key: foo" \ 
  --cacert /etc/pki/cluster/nomad-ca.pem \
  --cert /etc/pki/cluster/nomad.cli.pem \ 
  --key /etc/pki/cluster/nomad.cli-key.pem \
  "https://localhost:4646/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true"

succeeds, after which the nomad alloc exec commands will succeed until the connection is terminated.

jpclarks on 11 Jul 2019

FWIW, we're seeing the same thing after upgrading to 0.9.3 from 0.9.1.

daveadams on 12 Jul 2019

notnoop on 12 Jul 2019

🎉5

Awesome!

You guys do great work and are incredibly responsive. Thanks much!

Rod

Sent from my iPad

On Jul 11, 2019, at 10:02 PM, Mahmood Ali notifications@github.com wrote:

Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

onomojaku on 12 Jul 2019

👍2

Glad to hear that this is a bug, and that you were able to pinpoint the bug.
I'm eager to see what the fix is :)

Thank you for the responsiveness, and looking forward to upgrade to 0.9.4 !

Cheers

thehunt33r on 12 Jul 2019

tested with a dev binary and it indeed solved our issue. Thank you :)