Nomad: Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade

Created on 11 Mar 2020 · 8Comments · Source: hashicorp/nomad

Nomad version

Found on

$ nomad version
Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)

Also repros on these version in our test clusters

$ nomad version
Nomad v0.9.7 (0e0eb07c53f99f54bcdb2e69aa8a9690a0597e7a)

$ nomad version
Nomad v0.10.4 (f750636ca68e17dcd2445c1ab9c5a34f9ac69345)



md5-54ed9c500556e4e20691037fc0bec932



{
    "Job": {
        "ID": "test-nomad-rolling-deployments",
        "Name": "test-nomad-rolling-deployments",
        "Type": "service",
        "Priority": 50,
        "Region": "a-region",
        "DataCenters": [
            "a-dc"
        ],
        "TaskGroups": [
            {
                "Count": 3,
                "Tasks": [
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "main_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "main_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 30000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-main"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "main_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "main"
                    },
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "secondary_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "secondary_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 180000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-secondary"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "secondary_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "secondary"
                    }
                ],
                "RestartPolicy": {
                    "Attempts": 3,
                    "Delay": 15000000000,
                    "Interval": 180000000000,
                    "Mode": "fail"
                },
                "Update": {
                    "MaxParallel": 1,
                    "AutoRevert": true,
                    "HealthCheck": "checks",
                    "ProgressDeadline": 960000000000,
                    "HealthyDeadline": 900000000000,
                    "MinHealthyTime": 10000000000,
                    "Stagger": 10000000000
                },
                "Name": "test-nomad-rolling-deployments"
            }
        ]
    }
}

I've left off other logs as I think the repro steps are sufficient and this reproduces 100% of the time in our setup, but happy to gather some if necessary.

themdeployments typbug

Source

dpn

All 8 comments

We have also seen this problem in our testing with v0.10.4, and don't fully understand the situation but it's definitely a problem. Basically, no deployment will ever fail due to health checks currently.

We have a job whose task will never get healthy, yet for some reason, Nomad always passes the deployment. Consul properly reports the task as unhealthy.

Another big problem related to this is that it seems the restart only restarts successfully once. We've replicated this behavior a few times now. The check_restart does its job by triggering the _first_ restart, and then nothing. It seems the check_restart after an allocation is restarted the first time is unable to trigger the restart policy.

We can mitigate this by setting the restart policy to 0 with mode: fail, that will always force a reschedule on the first failure. However, this speaks another problematic scenario which we cannot confirm: if a healthy task all of a sudden goes unhealthy, does the check_restart policy go into effect?

During testing, we've set the task's check's check_restart to grace:0, limit:3, with a check interval of 10s. This results successfully in the first restart after 20s (1st check happens immediately it seems, then 2 more checks of 10s).

Solid report @dpn, this regression seems pretty dire.

Thank you @kainoaseto and @tydomitrovich for finding and helping to troubleshoot.

djenriquez on 12 Mar 2020

👍1

Hi guys, apologies if I'm inflating the priority for this issue, but it seems pretty serious that we cannot depend on health checks of allocations during deployments.

Could we get confirmation that this issue has been acknowledged and is being prioritized (hopefully on the higher side)?

@tgross @drewbailey @dadgar ?

djenriquez on 12 Mar 2020

@dpn @djenriquez This seems very bad indeed. I'll be investigating this now and will post updates when I get an understanding of the underlying issue and if there are any mitigating factors. Thank you very much for the detailed and clear reproducibility steps.

notnoop on 13 Mar 2020

👍1

Thanks again for the issue. It's indeed very serious - it affects virtually all deployments and affects nomad versions as old as 0.8.0, but I believe earlier.

It affects deployments where min_healthy_time is less than the restart delay. While the task is being restarted, nomad client may consider it healthy!

One workaround is to increase min_healthy_time to be higher than possible restart delays.

I'm working on the fix and aim to have it ready later this week.

notnoop on 16 Mar 2020

👍1

Thanks @notnoop, really appreciate you digging into this. Do you think this will be backported to the 0.9 and 0.10 series of releases? I know we're lagging behind by being on 0.9 but we'll be finishing up our 0.10 validation soon and plan to migrate over once that's complete.

dpn on 16 Mar 2020

Thank you @notnoop for looking into this and the workaround in the meantime! I will look at implementing that fix in our job's for our 0.10 clusters to mitigate this bug and will watch for the fix later this week.

kainoaseto on 16 Mar 2020

Hi @notnoop and anyone else that runs into this before the release of the fix in 0.11.0. I was able to test the mitigation by changing the Restart.Delay to be < min_healthy_time as was suggested and was able to:

have allocations fail during deployments from health checks failing
have allocations fail and reschedule from health checks failing

Thanks for the workaround!

Below is some sample configuration in case anyone else runs into the same thing:

All at the taskgroup level:

    "ReschedulePolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "DelayFunction": "exponential",
        "Interval": 0,
        "MaxDelay": 60000000000,
        "Unlimited": true
      },
      "RestartPolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "Interval": 1800000000000,
        "Mode": "fail"
      },
      "Services": [
        {
          "AddressMode": "auto",
          "Checks": [
            {
              "AddressMode": "",
              "CheckRestart": {
                "Grace": 10000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "InitialStatus": "warning",
              "Interval": 10000000000,
              "Method": "GET",
              "Name": "healthy",
              "Path": "/healthcheck",
              "PortLabel": "my-service",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 5000000000,
              "Type": "http"
            }
          ],
    .
    .
    .
    "Update": {
        "AutoPromote": false,
        "AutoRevert": true,
        "Canary": 0,
        "HealthCheck": "checks",
        "HealthyDeadline": 300000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 200000000000,
        "ProgressDeadline": 600000000000,
        "Stagger": 30000000000
      },

kainoaseto on 26 Mar 2020

👍1

Thanks @notnoop for the quick fix!

dpn on 28 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings