Found on 0.9.x series of releases. Currently unknown if reproduces on 0.10.x releases- we're currently attempting determine if we see reproduction in our 0.10.x test cluster.
/ # nomad version
Nomad v0.9.7 (0e0eb07c53f99f54bcdb2e69aa8a9690a0597e7a)
CentOS Linux release 7.7.1908 (Core)3.10.0-1062.18.1.el7.x86_64What we're seeing is that after running for a Docker task for a while (hours to days), Nomad stops streaming Docker stdout/stderr to their respective log files. This appears to be a Nomad issue as running docker logs against the effected containers properly streams logs, so it seems that the log collection is failing somehow. This doesn't appear to be related to log rotation as this has been seen where the $task.std{out,err}.0 log file is still under the size limit for rotation.
This was originally reported by one of our teams a few months back, it seemed at the time to only effect one of their jobs and we had an alternative method of shipping their logs so we didn't dig too hard on it. We've now become aware of other jobs being effected by this so we're starting to take a closer look.
We're still in the process of doing investigation but wanted to at least get the issue submitted and we'll provide more information as we learn more.
stdout/stderrstdout/stderr to $task.std{out,err}.X log files{
"Job": {
"Datacenters": [
"a-dc"
],
"ID": "a-nomad-job",
"Migrate": null,
"Name": "job-name",
"Namespace": "default",
"ParameterizedJob": null,
"ParentID": "",
"Payload": null,
"Periodic": null,
"Priority": 50,
"Region": "a-region",
"Reschedule": null,
"Spreads": [
{
"Attribute": "${node.unique.id}",
"SpreadTarget": null,
"Weight": 50
}
],
"Stable": true,
"Status": "running",
"StatusDescription": "",
"Stop": false,
"TaskGroups": [
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.vault.version}",
"Operand": "version",
"RTarget": ">= 0.6.1"
},
{
"LTarget": "${attr.os.signals}",
"Operand": "set_contains",
"RTarget": "SIGINT"
}
],
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "job-name",
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 30000000000,
"DelayFunction": "exponential",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 2,
"Delay": 15000000000,
"Interval": 1800000000000,
"Mode": "fail"
},
"Spreads": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"volumes": [
"/var/lib/nomad/alloc/${NOMAD_ALLOC_ID}/${NOMAD_ALLOC_DIR}/logs:/opt/docker/logs"
],
"extra_hosts": [
"telegraf:${attr.unique.network.ip-address}"
],
"force_pull": true,
"image": "a-docker-image:latest",
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Leader": false,
"LogConfig": {
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "a-nomad-job-main",
"Resources": {
"CPU": 240,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 128,
"Networks": [
{
"CIDR": "",
"Device": "",
"DynamicPorts": [
{
"Label": "http",
"Value": 0
}
],
"IP": "",
"MBits": 10,
"ReservedPorts": null
}
]
},
"Services": [
{
"AddressMode": "auto",
"CanaryTags": null,
"CheckRestart": null,
"Checks": null,
"Id": "",
"Name": "a-service",
"PortLabel": "http",
"Tags": null
}
],
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeSignal": "SIGHUP",
"DestPath": "/secrets/shared_secrets.env",
"EmbeddedTmpl": "REDACTED",
"Envvars": true,
"LeftDelim": "{{",
"Perms": "0600",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"VaultGrace": 15000000000
},
{
"ChangeMode": "restart",
"ChangeSignal": "SIGHUP",
"DestPath": "/local/enc_environment.env",
"EmbeddedTmpl": "REDACTED",
"Envvars": true,
"LeftDelim": "{{",
"Perms": "0644",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"VaultGrace": 15000000000
}
],
"User": "",
"Vault": {
"ChangeMode": "restart",
"ChangeSignal": "SIGHUP",
"Env": false,
"Policies": [
"REDACTED",
"REDACTED",
"REDACTED"
]
}
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 120000000000,
"MaxParallel": 1,
"MinHealthyTime": 30000000000,
"ProgressDeadline": 600000000000,
"Stagger": 30000000000
}
}
],
"Type": "service",
"Update": {
"AutoPromote": false,
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "",
"HealthyDeadline": 0,
"MaxParallel": 1,
"MinHealthyTime": 0,
"ProgressDeadline": 0,
"Stagger": 30000000000
},
"VaultToken": "",
"Version": 29
}
}
Will grab these when we see it reproduce again or I'm able to discover more instances currently hidden in the fleet.
Hi @dpn! When this happens are both the log shipping processes still running or have they crashed? This will be nomad logmon and nomad docker_logger for each running job.
Thanks for the response @tgross. I'm trying to find the time to throw some scripts together to track down containers in this state to see if this is reproducing on 0.10.x in our setup and once that's complete I can leverage those to answer that question. Will follow up with what I find.
I seem to be seeing this behaviour as well. I'm not sure how to correlate logmon and docker_logger to the respective containers but I have a whole heap of them running. And this is happening on a freshly started container.
Docker 19.03.5
Nomad v0.11.2
same problem here
Docker v19.03.12
Nomad v0.12.0
Most helpful comment
same problem here
Docker v19.03.12
Nomad v0.12.0