Nomad: Nomad node-drain enable output ordering

Created on 19 Apr 2018  路  4Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.8.1 (46aa11ba851f114f5faa0876839b45fbca0c47f0)

Operating system and Environment details

lsb_release: Debian GNU/Linux 9.4 (stretch)
cloud_env: aws

Issue

When performing nomad node-drain -enable <node_ID> the output is slightly misleading and mistimed as it logs that the node drain is complete before the status of all allocs are logged as being complete. I have experienced this on all node drains performed, please see below for two example console outputs:

2018-04-19T08:01:58Z: Node "f913db8d-8af6-c5e1-2957-ed54cece201c" drain strategy set
2018-04-19T08:01:59Z: Alloc "5067a44a-e31c-0d03-1040-960f352ddcac" marked for migration
2018-04-19T08:01:59Z: Alloc "5067a44a-e31c-0d03-1040-960f352ddcac" draining
2018-04-19T08:01:59Z: Alloc "5067a44a-e31c-0d03-1040-960f352ddcac" status running -> complete
2018-04-19T08:02:09Z: Alloc "f5a363c6-acfb-072b-acc0-2f19d3dac28c" status running -> complete
2018-04-19T08:02:09Z: Alloc "630ca380-b257-2f44-08cd-d0b1c61139d4" marked for migration
2018-04-19T08:02:09Z: Node "f913db8d-8af6-c5e1-2957-ed54cece201c" drain complete
2018-04-19T08:02:09Z: Alloc "630ca380-b257-2f44-08cd-d0b1c61139d4" draining
2018-04-19T08:02:09Z: Alloc "630ca380-b257-2f44-08cd-d0b1c61139d4" status running -> complete
2018-04-19T08:02:09Z: All allocations on node "f913db8d-8af6-c5e1-2957-ed54cece201c" have stopped.
2018-04-19T08:12:19Z: Node "4f990645-c723-9f9f-5379-786f88f4f989" drain strategy set
2018-04-19T08:12:21Z: Alloc "794d5374-13f7-4859-9143-36a734fa64ba" marked for migration
2018-04-19T08:12:21Z: Alloc "1b2fc50c-610a-bff7-2033-966ed07c8ddd" marked for migration
2018-04-19T08:12:21Z: Alloc "e585f0fc-ee85-937c-1393-1e071e733d21" marked for migration
2018-04-19T08:12:21Z: Alloc "37ab49ab-42d9-0693-2c59-1db9090b4139" marked for migration
2018-04-19T08:12:21Z: Alloc "794d5374-13f7-4859-9143-36a734fa64ba" draining
2018-04-19T08:12:21Z: Alloc "e585f0fc-ee85-937c-1393-1e071e733d21" draining
2018-04-19T08:12:21Z: Alloc "1b2fc50c-610a-bff7-2033-966ed07c8ddd" draining
2018-04-19T08:12:21Z: Alloc "37ab49ab-42d9-0693-2c59-1db9090b4139" draining
2018-04-19T08:12:22Z: Alloc "794d5374-13f7-4859-9143-36a734fa64ba" status running -> complete
2018-04-19T08:12:22Z: Alloc "4ffff6cf-40e7-3013-436a-1c10654e4507" marked for migration
2018-04-19T08:12:22Z: Alloc "f277e614-fa8d-dfeb-cc29-6efa739f506f" marked for migration
2018-04-19T08:12:22Z: Alloc "50c1dbb4-1ab5-23aa-3200-7549ecba9568" marked for migration
2018-04-19T08:12:22Z: Alloc "1b2fc50c-610a-bff7-2033-966ed07c8ddd" status running -> complete
2018-04-19T08:12:22Z: Alloc "37ab49ab-42d9-0693-2c59-1db9090b4139" status running -> complete
2018-04-19T08:12:22Z: Alloc "f277e614-fa8d-dfeb-cc29-6efa739f506f" draining
2018-04-19T08:12:22Z: Alloc "50c1dbb4-1ab5-23aa-3200-7549ecba9568" draining
2018-04-19T08:12:22Z: Alloc "4ffff6cf-40e7-3013-436a-1c10654e4507" draining
2018-04-19T08:12:22Z: Alloc "8fe49e0c-5f2e-67a2-acb2-5b86b921a4ec" marked for migration
2018-04-19T08:12:22Z: Node "4f990645-c723-9f9f-5379-786f88f4f989" drain complete
2018-04-19T08:12:22Z: Alloc "8fe49e0c-5f2e-67a2-acb2-5b86b921a4ec" draining
2018-04-19T08:12:22Z: Alloc "8fe49e0c-5f2e-67a2-acb2-5b86b921a4ec" status running -> complete
2018-04-19T08:12:22Z: Alloc "e585f0fc-ee85-937c-1393-1e071e733d21" status running -> complete
2018-04-19T08:12:22Z: Alloc "50c1dbb4-1ab5-23aa-3200-7549ecba9568" status running -> complete
2018-04-19T08:12:23Z: Alloc "4ffff6cf-40e7-3013-436a-1c10654e4507" status running -> complete
2018-04-19T08:12:23Z: Alloc "f277e614-fa8d-dfeb-cc29-6efa739f506f" status running -> complete
2018-04-19T08:12:23Z: All allocations on node "4f990645-c723-9f9f-5379-786f88f4f989" have stopped.

Reproduction steps

Enable node-drain on a worker node and observe the output.

themcore typenhancement

All 4 comments

Hi @jrasell,

This is by design but hopefully we can figure out how to make this behavior less surprising!

I attempted to explain this output in the docs, but its buried at the very end of the Decommissioning Nodes guide:

The final node drain command did not exit until 6 seconds after the drain complete message because the command line tool blocks until all allocations on the node have stopped. This allows operators to script shutting down a node once a drain command exits and know all services have already exited.

The node drain command docs don't really mention this behavior either. The most relevant line is:

By default the node drain command blocks until a node is done draining and all allocations have terminated.

The reason for this behavior is that a drain is complete once all allocations are _marked for migration,_ not when the allocations have completed. The CLI waits until all allocations have completed in an attempt to ease scripting.

We considered having the drain itself not be considered complete until all allocations have completed, but once allocations are marked for migration on a node there are no more actions for the drain to take. It's functionally complete and monitoring for allocation completion would just tie up server resources for longer.

Resolution

I think the best resolution is expanding the node drain docs to clarify this behvaior, but I'm curious if you think that's sufficient.

Thanks for the thorough report!

Thanks for the detailed response @schmichael; I now understand properly what is happening. Additional docs would be great if they clarify Node "4f990645-c723-9f9f-5379-786f88f4f989" drain complete means that all allocations have been marked for migration.

I am curious though if a change of log message would be better so operators don't have to read detailed guides about migration logic to understand drain complete doesn't mean the node has been drained? Just a thought, happy with whatever you deem to be best here.

I am curious though if a change of log message would be better so operators don't have to read detailed guides about migration logic to understand drain complete doesn't mean the node has been drained?

Hm, this seems like the best idea. I chose drain complete because from the server's perspective there's no more work to be done so the Node.DrainSpec field is cleared. Maybe this would make a better log message?

Node "4f990645-c723-9f9f-5379-786f88f4f989" has marked all allocations for migration.

I'll see what we can come up with and let me know if you have any ideas! I'd much rather make the log message less confusing then have to write a whole paragraph explaining exactly what drain complete means.

@schmichael that log message would be a good improvement in my opinion.

Was this page helpful?
0 / 5 - 0 ratings