Nomad: Nomad stops streaming Docker stderr/stdout logs to their respective files

Created on 3 Apr 2020 · 4Comments · Source: hashicorp/nomad

Nomad version

Found on 0.9.x series of releases. Currently unknown if reproduces on 0.10.x releases- we're currently attempting determine if we see reproduction in our 0.10.x test cluster.

/ # nomad version
Nomad v0.9.7 (0e0eb07c53f99f54bcdb2e69aa8a9690a0597e7a)

Operating system and Environment details

OS Release: CentOS Linux release 7.7.1908 (Core)
Kernel: 3.10.0-1062.18.1.el7.x86_64
Colocated bare metal fleet

Issue

What we're seeing is that after running for a Docker task for a while (hours to days), Nomad stops streaming Docker stdout/stderr to their respective log files. This appears to be a Nomad issue as running docker logs against the effected containers properly streams logs, so it seems that the log collection is failing somehow. This doesn't appear to be related to log rotation as this has been seen where the $task.std{out,err}.0 log file is still under the size limit for rotation.

This was originally reported by one of our teams a few months back, it seemed at the time to only effect one of their jobs and we had an alternative method of shipping their logs so we didn't dig too hard on it. We've now become aware of other jobs being effected by this so we're starting to take a closer look.

We're still in the process of doing investigation but wanted to at least get the issue submitted and we'll provide more information as we learn more.

Reproduction steps

Submit a Nomad job that writes to stdout/stderr
Let it run for a bit (order of magnitude is hours to days)
Eventually Nomad will stop writing Docker stdout/stderr to $task.std{out,err}.X log files

Job file (if appropriate)

{
    "Job": {
        "Datacenters": [
            "a-dc"
        ],
        "ID": "a-nomad-job",
        "Migrate": null,
        "Name": "job-name",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "a-region",
        "Reschedule": null,
        "Spreads": [
            {
                "Attribute": "${node.unique.id}",
                "SpreadTarget": null,
                "Weight": 50
            }
        ],
        "Stable": true,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "TaskGroups": [
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.vault.version}",
                        "Operand": "version",
                        "RTarget": ">= 0.6.1"
                    },
                    {
                        "LTarget": "${attr.os.signals}",
                        "Operand": "set_contains",
                        "RTarget": "SIGINT"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "job-name",
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 30000000000,
                    "DelayFunction": "exponential",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Spreads": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "volumes": [
                                "/var/lib/nomad/alloc/${NOMAD_ALLOC_ID}/${NOMAD_ALLOC_DIR}/logs:/opt/docker/logs"
                            ],
                            "extra_hosts": [
                                "telegraf:${attr.unique.network.ip-address}"
                            ],
                            "force_pull": true,
                            "image": "a-docker-image:latest",
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "a-nomad-job-main",
                        "Resources": {
                            "CPU": 240,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 128,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "http",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": null,
                                "Id": "",
                                "Name": "a-service",
                                "PortLabel": "http",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "/secrets/shared_secrets.env",
                                "EmbeddedTmpl": "REDACTED",
                                "Envvars": true,
                                "LeftDelim": "{{",
                                "Perms": "0600",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "VaultGrace": 15000000000
                            },
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "/local/enc_environment.env",
                                "EmbeddedTmpl": "REDACTED",
                                "Envvars": true,
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 5000000000,
                                "VaultGrace": 15000000000
                            }
                        ],
                        "User": "",
                        "Vault": {
                            "ChangeMode": "restart",
                            "ChangeSignal": "SIGHUP",
                            "Env": false,
                            "Policies": [
                                "REDACTED",
                                "REDACTED",
                                "REDACTED"
                            ]
                        }
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": true,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 120000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 30000000000,
                    "ProgressDeadline": 600000000000,
                    "Stagger": 30000000000
                }
            }
        ],
        "Type": "service",
        "Update": {
            "AutoPromote": false,
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 1,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 30000000000
        },
        "VaultToken": "",
        "Version": 29
    }
}

Nomad Client logs (if appropriate)

Will grab these when we see it reproduce again or I'm able to discover more instances currently hidden in the fleet.

themlogging typbug

Source

dpn

👍2

Most helpful comment

same problem here
Docker v19.03.12
Nomad v0.12.0

ifokeev on 28 Jul 2020

👍2

All 4 comments

Hi @dpn! When this happens are both the log shipping processes still running or have they crashed? This will be nomad logmon and nomad docker_logger for each running job.

tgross on 7 Apr 2020

Thanks for the response @tgross. I'm trying to find the time to throw some scripts together to track down containers in this state to see if this is reproducing on 0.10.x in our setup and once that's complete I can leverage those to answer that question. Will follow up with what I find.

dpn on 7 Apr 2020

I seem to be seeing this behaviour as well. I'm not sure how to correlate logmon and docker_logger to the respective containers but I have a whole heap of them running. And this is happening on a freshly started container.

Docker 19.03.5
Nomad v0.11.2