Hey everyone,
I'll preface this by saying that this seems intermittent. Sometimes you will be able to nomad logs the allocation, sometimes the call will hang. However it happens on running, failed or completed allocations alike.
When calling nomad alloc logs on an allocation, the call hangs. This also happens on the Web UI where both the direct call to the node, and the call through the master fail.
One thing to note is that logs are available on the filesystem of the clients and can be read without any problems.
I'm not sure whether this is a question or more a bug report as I can't reproduce the issue on other versions of Nomad (0.9.1).
$ nomad alloc logs -job myjob
Doesn't seem dependent on the job since sometimes it works for one allocation of a job, and not for another one
consul {
# The address to the Consul agent.
address = "127.0.0.1:8500"
# The service name to register the server and client with Consul.
server_service_name = "nomad"
client_service_name = "nomadclient"
# Enables automatically registering the services.
auto_advertise = true
# Enabling the server and client to bootstrap using Consul.
server_auto_join = true
client_auto_join = true
}
tls {
http = true
rpc = true
ca_file = "/cert/cert.crt"
cert_file = "/etc/ssl/server.eu-central-1.nomad/cert.pem"
key_file = "/etc/ssl/server.eu-central-1.nomad/key.pem"
verify_server_hostname = true
}
telemetry {
publish_allocation_metrics = true
publish_node_metrics = true
datadog_address = "127.0.0.1:8125"
disable_hostname = true
}
enable_syslog = true
acl {
enabled = true
token_ttl = "30s"
policy_ttl = "60s"
}
datacenter = "datacenter"
data_dir = "/tmp/nomad"
region = "eu-central-1"
server {
enabled = true
bootstrap_expect = 3
encrypt = "xxxxxxxx"
job_gc_threshold = "12h"
eval_gc_threshold = "6h"
deployment_gc_threshold = "6h"
authoritative_region = "eu-central-1"
}
vault {
enabled = true
address = "https://vault.service.consul"
token = "XXXXXX"
create_from_role = "system_nomad"
}
datacenter = "datacenter"
data_dir = "/tmp/nomad"
region = "eu-central-1"
client {
enabled = true
meta = {
}
options = {
"docker.auth.config" = "/root/.docker/config.json"
"docker.caps.whitelist" = "CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP, SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE,NET_ADMIN"
}
gc_interval = "10m"
gc_max_allocs = 100
}
advertise {
}
vault {
enabled = true
address = "https://vault.service.consul"
}
Couldn't find anything meaningful in Nomad logs.
If possible please post relevant logs in the issue.
In the browser I see:
nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to //10.102.3.22:4646/v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout
(anonymous) @ nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164
nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to /v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout
The first thing is expected, the client nodes don't have a certificate that correspond to their IP.
The second call, directly to the master, that will then proxy to the client, is the one that is flaky for us. Sometimes, refreshing is enough to get the logs (in the UI), or making another nomad logs call. Other times, the call will just hang (through curl or nomad logs)
I've been trying to figure it out for a few days at this point and can't find anything that could indicate what's happening.
Thanks
What happens if you run the CLI command on the client node where the allocation was scheduled (i.e., where the logs are still present on the file system)?
$ nomad alloc logs -job myjob
we experience the same behavior, the logs are working intermediately
Hey @cgbaker
From the node, I was able to constantly get the logs:
$ NOMAD_ADDR=https://localhost:4646 NOMAD_SKIP_VERIFY=1 nomad logs -stderr $myAlloc
From my machine, pointing NOMAD_ADDR to the masters, Nomad was still only displaying the logs intermittently (~ once every 5 to 10 tries)
We have similar problems with both nomad alloc logs and nomad alloc exec. Typically it works fine on the node with the allocation, and it does not work for other nodes. We didn't see this behavior in Nomad 0.9.0, but we started seeing it in 0.9.2 and it persists in 0.9.3. Preetha asked for log/conf/job files, but it looks like you beat me to getting them transferred off of our private network and available to the internet at wild. Thanks for the thorough documentation and bug report.
Here is the example that @onomojaku referenced:
The alloc 3cdea42e is running on NOMAD_CLIENT_1. The command nomad alloc exec succeeds on NOMAD_CLIENT_1:
NOMAD_CLIENT_1 > nomad alloc exec 3cdea42e bash
root@3cdea42e:/#
Here are the logs:
NOMAD_CLIENT_1 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 15:04:37 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:37.919Z [INFO ] client: task exec session starting: exec_id=0c3684e1-d128-237f-770a-c2213644878c alloc_id=3cdea42e-c682-c292-06f4-db96c0d8ca2d task=rabbitmq command=[bash] tty=true access_token_name= access_token_id=
Jul 10 15:04:40 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:40.028Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true duration=2.108800229s
However, the command nomad alloc exec fails on NOMAD_CLIENT_2 (where the alloc 3cdea42e is not running):
NOMAD_CLIENT_2 > nomad alloc exec 3cdea42e bash
failed to exec into task: write tcp NOMAD_SERVER_1:43064->NOMAD_SERVER_2:4647: write: connection reset by peer
Here are the logs:
NOMAD_CLIENT_2 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true error="write tcp NOMAD_SERVER_IP_1:57470->NOMAD_SERVER_IP_2:4647: write: connection reset by peer" code=500
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true duration=16.958344ms
Of note, creating a websocket with
NOMAD_CLIENT_2 > curl -X GET \
--include \
--no-buffer \
--verbose \
--header "Connection: Upgrade" \
--header "Upgrade: websocket" \
--header "Sec-Websocket-Version: 13" \
--header "Sec-WebSocket-Key: foo" \
--cacert /etc/pki/cluster/nomad-ca.pem \
--cert /etc/pki/cluster/nomad.cli.pem \
--key /etc/pki/cluster/nomad.cli-key.pem \
"https://localhost:4646/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true"
succeeds, after which the nomad alloc exec commands will succeed until the connection is terminated.
FWIW, we're seeing the same thing after upgrading to 0.9.3 from 0.9.1.
Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.
Awesome!
You guys do great work and are incredibly responsive. Thanks much!
Rod
Sent from my iPad
On Jul 11, 2019, at 10:02 PM, Mahmood Ali notifications@github.com wrote:
Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Glad to hear that this is a bug, and that you were able to pinpoint the bug.
I'm eager to see what the fix is :)
Thank you for the responsiveness, and looking forward to upgrade to 0.9.4 !
Cheers
tested with a dev binary and it indeed solved our issue. Thank you :)
Most helpful comment
Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.