Nomad: Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade

Created on 11 Mar 2020  路  8Comments  路  Source: hashicorp/nomad

Nomad version

Found on

$ nomad version
Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)

Also repros on these version in our test clusters

$ nomad version
Nomad v0.9.7 (0e0eb07c53f99f54bcdb2e69aa8a9690a0597e7a)
$ nomad version
Nomad v0.10.4 (f750636ca68e17dcd2445c1ab9c5a34f9ac69345)



md5-54ed9c500556e4e20691037fc0bec932



{
    "Job": {
        "ID": "test-nomad-rolling-deployments",
        "Name": "test-nomad-rolling-deployments",
        "Type": "service",
        "Priority": 50,
        "Region": "a-region",
        "DataCenters": [
            "a-dc"
        ],
        "TaskGroups": [
            {
                "Count": 3,
                "Tasks": [
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "main_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "main_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 30000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-main"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "main_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "main"
                    },
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "secondary_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "secondary_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 180000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-secondary"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "secondary_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "secondary"
                    }
                ],
                "RestartPolicy": {
                    "Attempts": 3,
                    "Delay": 15000000000,
                    "Interval": 180000000000,
                    "Mode": "fail"
                },
                "Update": {
                    "MaxParallel": 1,
                    "AutoRevert": true,
                    "HealthCheck": "checks",
                    "ProgressDeadline": 960000000000,
                    "HealthyDeadline": 900000000000,
                    "MinHealthyTime": 10000000000,
                    "Stagger": 10000000000
                },
                "Name": "test-nomad-rolling-deployments"
            }
        ]
    }
}

I've left off other logs as I think the repro steps are sufficient and this reproduces 100% of the time in our setup, but happy to gather some if necessary.

themdeployments typbug

All 8 comments

We have also seen this problem in our testing with v0.10.4, and don't fully understand the situation but it's definitely a problem. Basically, no deployment will ever fail due to health checks currently.

We have a job whose task will never get healthy, yet for some reason, Nomad always passes the deployment. Consul properly reports the task as unhealthy.

Another big problem related to this is that it seems the restart only restarts successfully once. We've replicated this behavior a few times now. The check_restart does its job by triggering the _first_ restart, and then nothing. It seems the check_restart after an allocation is restarted the first time is unable to trigger the restart policy.

We can mitigate this by setting the restart policy to 0 with mode: fail, that will always force a reschedule on the first failure. However, this speaks another problematic scenario which we cannot confirm: if a healthy task all of a sudden goes unhealthy, does the check_restart policy go into effect?

During testing, we've set the task's check's check_restart to grace:0, limit:3, with a check interval of 10s. This results successfully in the first restart after 20s (1st check happens immediately it seems, then 2 more checks of 10s).

Solid report @dpn, this regression seems pretty dire.

Thank you @kainoaseto and @tydomitrovich for finding and helping to troubleshoot.

Hi guys, apologies if I'm inflating the priority for this issue, but it seems pretty serious that we cannot depend on health checks of allocations during deployments.

Could we get confirmation that this issue has been acknowledged and is being prioritized (hopefully on the higher side)?

@tgross @drewbailey @dadgar ?

@dpn @djenriquez This seems very bad indeed. I'll be investigating this now and will post updates when I get an understanding of the underlying issue and if there are any mitigating factors. Thank you very much for the detailed and clear reproducibility steps.

Thanks again for the issue. It's indeed very serious - it affects virtually all deployments and affects nomad versions as old as 0.8.0, but I believe earlier.

It affects deployments where min_healthy_time is less than the restart delay. While the task is being restarted, nomad client may consider it healthy!

One workaround is to increase min_healthy_time to be higher than possible restart delays.

I'm working on the fix and aim to have it ready later this week.

Thanks @notnoop, really appreciate you digging into this. Do you think this will be backported to the 0.9 and 0.10 series of releases? I know we're lagging behind by being on 0.9 but we'll be finishing up our 0.10 validation soon and plan to migrate over once that's complete.

Thank you @notnoop for looking into this and the workaround in the meantime! I will look at implementing that fix in our job's for our 0.10 clusters to mitigate this bug and will watch for the fix later this week.

Hi @notnoop and anyone else that runs into this before the release of the fix in 0.11.0. I was able to test the mitigation by changing the Restart.Delay to be < min_healthy_time as was suggested and was able to:

  • have allocations fail during deployments from health checks failing
  • have allocations fail and reschedule from health checks failing

Thanks for the workaround!

Below is some sample configuration in case anyone else runs into the same thing:

All at the taskgroup level:

    "ReschedulePolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "DelayFunction": "exponential",
        "Interval": 0,
        "MaxDelay": 60000000000,
        "Unlimited": true
      },
      "RestartPolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "Interval": 1800000000000,
        "Mode": "fail"
      },
      "Services": [
        {
          "AddressMode": "auto",
          "Checks": [
            {
              "AddressMode": "",
              "CheckRestart": {
                "Grace": 10000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "InitialStatus": "warning",
              "Interval": 10000000000,
              "Method": "GET",
              "Name": "healthy",
              "Path": "/healthcheck",
              "PortLabel": "my-service",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 5000000000,
              "Type": "http"
            }
          ],
    .
    .
    .
    "Update": {
        "AutoPromote": false,
        "AutoRevert": true,
        "Canary": 0,
        "HealthCheck": "checks",
        "HealthyDeadline": 300000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 200000000000,
        "ProgressDeadline": 600000000000,
        "Stagger": 30000000000
      },

Thanks @notnoop for the quick fix!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hynek picture hynek  路  3Comments

jippi picture jippi  路  3Comments

clinta picture clinta  路  3Comments

mancusogmu picture mancusogmu  路  3Comments

leowmjw picture leowmjw  路  3Comments