0.5.4
5 windows boxes = 3 servers + 2 clients
After nomad client restart, some of scheduled jobs on that client was in pending state for more than two hours, without any attempt to start the jobs. This is severe issue for us and we would like to have it fixed ASAP.
I don't have exact steps. We know that this sometimes occurs after client restart . But we definitely faced with this problem more than once.
The full nomad client logs here
https://gist.github.com/capone212/94f47912f1f9e700195d9b45e846b7e3
On of problem jobs is "ngprunner.Node001"
When problem was reproducing i made following observations
>nomad status ngprunner.Node001
ID = ngprunner.Node001
Name = ngprunner.Node001
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
鈫怺1mSummary鈫怺0m鈫怺0m
Task Group Queued Starting Running Failed Complete Lost
ngp 0 1 0 0 0 1
postgres 0 0 1 0 0 1
鈫怺1mAllocations鈫怺0m鈫怺0m
ID Eval ID Node ID Task Group Desired Status Created At
2748409b 209b0121 34ce800b ngp run pending 04/14/17 11:25:09 EAT
94ec5139 209b0121 34ce800b postgres run running 04/14/17 11:25:09 EAT
3d170428 0224aabc 34ce800b postgres stop lost 04/14/17 10:39:24 EAT
b0741602 0224aabc 34ce800b ngp stop lost 04/14/17 10:39:24 EAT
>nomad alloc-status -verbose 2748409b
ID = 2748409b-ad2c-4008-9dde-72a38a01480e
Eval ID = 209b0121-81ec-ebf0-6616-a8a403136673
Name = ngprunner.Node001.ngp[0]
Node ID = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID = ngprunner.Node001
Client Status = pending
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 04/14/17 11:25:09 EAT
Evaluated Nodes = 4
Filtered Nodes = 3
Exhausted Nodes = 0
Allocation Time = 0s
Failures = 0
鈫怺1mPlacement Metrics鈫怺0m鈫怺0m
* Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
* Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
* Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000
You can see that allocation 2748409b in pending state. But there were no attempt to start or restart that allocation on host machine.
This is extraction from nomad.log about the job
2017/04/14 10:39:24.687255 [DEBUG] client: starting task runners for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be'
2017/04/14 10:39:24.687255 [DEBUG] client: starting task context for 'ngprunner' (alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be')
2017/04/14 10:39:24 [DEBUG] plugin: starting plugin: C:\Program Files\RaftLauncher\current\nomad.exe []string{"C:\\Program Files\\RaftLauncher\\current\\nomad.exe", "executor", "{\"LogFile\":\"C:\\\\ProgramData\\\\AxxonSoft\\\\AxxonNext\\\\RaFT\\\\cluster\\\\nomad\\\\data\\\\alloc\\\\b0741602-7127-f66e-5aaf-d4d1df1397be\\\\ngprunner\\\\executor.out\",\"LogLevel\":\"DEBUG\"}"}
2017/04/14 11:22:52.439567 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
2017/04/14 11:25:00.300414 [ERR] client: failed to open handle to task 'ngprunner' for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be': error connecting to plugin: error creating rpc client for executor plugin: OpenProcess: The parameter is incorrect.
2017/04/14 11:25:00.300414 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
2017/04/14 11:25:09.286548 [DEBUG] client: added alloc "2748409b-ad2c-4008-9dde-72a38a01480e" to blocked queue for previous allocation "b0741602-7127-f66e-5aaf-d4d1df1397be"
Allocation 2748409b was in "blocked queue". And after more than two hours allocation suddenly started
2017/04/14 13:40:09.807652 [DEBUG] client: starting task runners for alloc '2748409b-ad2c-4008-9dde-72a38a01480e'
2017/04/14 13:40:09.807652 [DEBUG] client: starting task context for 'ngprunner' (alloc '2748409b-ad2c-4008-9dde-72a38a01480e')
This is how allocation is looking after jobs suddenly started
ID = 2748409b-ad2c-4008-9dde-72a38a01480e 13:58
Eval ID = 209b0121-81ec-ebf0-6616-a8a403136673
Name = ngprunner.Node001.ngp[0]
Node ID = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID = ngprunner.Node001
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 04/14/17 11:25:09 EAT
Evaluated Nodes = 4
Filtered Nodes = 3
Exhausted Nodes = 0
Allocation Time = 0s
Failures = 0
鈫怺1mTask "ngprunner" is "running"鈫怺0m鈫怺0m
Task Resources
CPU Memory Disk IOPS Addresses
10000 MHz 2.9 GiB 0 B 0 NGP_BASE: 10.0.15.1:1000
Recent Events:
Time Type Description
04/14/17 13:40:11 EAT Started Task started by client
04/14/17 13:40:09 EAT Received Task received by client
鈫怺1mPlacement Metrics鈫怺0m鈫怺0m
* Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
* Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
* Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000
Can you give the output of: curl http://127.0.0.1:4646/v1/node/<full node-id>/allocations
@capone212 That should fix it! Master should be bug free if you want to build and role out master!
@dadgar thanks a lot for your quick fix!!!
Hi @dadgar I have just noticed that unfortunately fix does not help. Good news through I can reproduce it easily now in my environment.
With 3 nodes in server+client mode I do the following:
1) Create pinned job to one specific server : job.Constrain(apiNomad.NewConstraint("${node.unique.name}", "=", agentId))
2) Run job
3) power off that specific server
4) wait 1 minute, update that job
5) up server
At that point roughly at 30-50% cases we will have task with pending state.
The following is debug data from my last test with your fix applied.
>nomad status pgrunner.cxRPaR97LE.SRV1
ID = pgrunner.cxRPaR97LE.SRV1
Name = pgrunner.cxRPaR97LE.SRV1
Type = service
Priority = 60
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
鈫怺1mSummary鈫怺0m鈫怺0m
Task Group Queued Starting Running Failed Complete Lost
postgres 0 1 1 0 4 0
鈫怺1mAllocations鈫怺0m鈫怺0m
ID Eval ID Node ID Task Group Desired Status Created At
0eac3064 75bbb534 f303c409 postgres run pending 06/22/17 19:13:06 BOT
19fb4725 ea67e6e3 f303c409 postgres stop complete 06/22/17 19:09:08 BOT
75443349 4bf104fd f303c409 postgres stop complete 06/22/17 19:08:02 BOT
05569c70 08f92ee3 f303c409 postgres stop running 06/22/17 19:05:07 BOT
d28605a7 a8e408f1 f303c409 postgres stop complete 06/22/17 19:02:19 BOT
a0284ddf 2992e4a9 f303c409 postgres stop complete 06/22/17 18:58:46 BOT
From logs:
2017/06/22 19:13:06.348691 [DEBUG] client: added alloc "0eac3064-a10f-1eb0-f7e2-1c540dd29b97" to blocked queue for previous allocation "19fb4725-34fa-5e15-91b3-c5ce6899e860"
From allocations
1
ID "0eac3064-a10f-1eb0-f7e2-1c540dd29b97"
EvalID "75bbb534-6f17-9edf-e314-9aed533ffdbe"
Name "pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID "f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID "pgrunner.cxRPaR97LE.SRV1"
TaskGroup "postgres"
DesiredStatus "run"
DesiredDescription ""
ClientStatus "pending"
ClientDescription ""
TaskStates
CreateIndex 2041
ModifyIndex 2041
CreateTime 1498173186329835000
2
ID "19fb4725-34fa-5e15-91b3-c5ce6899e860"
EvalID "ea67e6e3-6280-0358-0368-519e03ee1740"
Name "pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID "f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID "pgrunner.cxRPaR97LE.SRV1"
TaskGroup "postgres"
DesiredStatus "stop"
DesiredDescription "alloc is lost since its node is down"
ClientStatus "complete"
ClientDescription ""
TaskStates
pgrunner
State "dead"
Failed false
StartedAt "2017-06-22T23:09:09.5177181Z"
FinishedAt "2017-06-22T23:12:57.7315015Z"
Events
0
Type "Received"
Time 1498172948476707300
FailsTask false
RestartReason ""
SetupError ""
DriverError ""
ExitCode 0
Signal 0
Message ""
KillTimeout 0
KillError ""
KillReason ""
StartDelay 0
DownloadError ""
ValidationError ""
DiskLimit 0
FailedSibling ""
VaultError ""
TaskSignalReason ""
TaskSignal ""
DriverMessage ""
1
Type "Task Setup"
Time 1498172948476707300
FailsTask false
RestartReason ""
SetupError ""
DriverError ""
ExitCode 0
Signal 0
Message "Building Task Directory"
KillTimeout 0
KillError ""
KillReason ""
StartDelay 0
DownloadError ""
ValidationError ""
DiskLimit 0
FailedSibling ""
VaultError ""
TaskSignalReason ""
TaskSignal ""
DriverMessage ""
2
Type "Started"
Time 1498172949517718000
FailsTask false
RestartReason ""
SetupError ""
DriverError ""
ExitCode 0
Signal 0
Message ""
KillTimeout 0
KillError ""
KillReason ""
StartDelay 0
DownloadError ""
ValidationError ""
DiskLimit 0
FailedSibling ""
VaultError ""
TaskSignalReason ""
TaskSignal ""
DriverMessage ""
3
Type "Started"
Time 1498173175443420200
FailsTask false
RestartReason ""
SetupError ""
DriverError ""
ExitCode 0
Signal 0
Message ""
KillTimeout 0
KillError ""
KillReason ""
StartDelay 0
DownloadError ""
ValidationError ""
DiskLimit 0
FailedSibling ""
VaultError ""
TaskSignalReason ""
TaskSignal ""
DriverMessage ""
4
Type "Killing"
Time 1498173175443420200
FailsTask false
RestartReason ""
SetupError ""
DriverError ""
ExitCode 0
Signal 0
Message ""
KillTimeout 5000000000
KillError ""
KillReason ""
StartDelay 0
DownloadError ""
ValidationError ""
DiskLimit 0
FailedSibling ""
VaultError ""
TaskSignalReason ""
TaskSignal ""
DriverMessage ""
5
Type "Killed"
Time 1498173177579159800
FailsTask false
RestartReason ""
SetupError ""
DriverError ""
ExitCode 0
Signal 0
Message ""
KillTimeout 0
KillError ""
KillReason ""
StartDelay 0
DownloadError ""
ValidationError ""
DiskLimit 0
FailedSibling ""
VaultError ""
TaskSignalReason ""
TaskSignal ""
DriverMessage ""
CreateIndex 1989
ModifyIndex 2034
CreateTime 1498172948382785500
Please find full log and full allocations attached:
@dadgar , If you need any help please let me know. As I sad, I can reproduce it and eager to test patches.
And yes, after several hours, job finally started!
ID = pgrunner.cxRPaR97LE.SRV1
Name = pgrunner.cxRPaR97LE.SRV1
Type = service
Priority = 60
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
鈫怺1mSummary鈫怺0m鈫怺0m
Task Group Queued Starting Running Failed Complete Lost
postgres 0 0 2 0 4 0
鈫怺1mAllocations鈫怺0m鈫怺0m
ID Eval ID Node ID Task Group Desired Status Created At
0eac3064 75bbb534 f303c409 postgres run running 06/22/17 19:13:06 BOT
The essence of what is going on
[ERR] client: cleanup allocation faf96d53-2900-acd2-61b7-dbe621244981==> Nomad agent configuration:
2017/06/22 19:12:54.403378 [ERR] client: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
2017/06/22 19:12:54.403378 [ERR] client: failed to restore state for alloc faf96d53-2900-acd2-61b7-dbe621244981: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
2017/06/22 19:12:56.662167 [DEBUG] client: alloc "faf96d53-2900-acd2-61b7-dbe621244981" in terminal status, waiting for destroy
2017/06/22 19:12:56.662167 [DEBUG] client: added alloc "c75d7468-0055-73ca-1411-bfb932f9de7b" to blocked queue for previous allocation "faf96d53-2900-acd2-61b7-dbe621244981"
2017/06/22 20:40:34.379971 [INFO] client: marking allocation faf96d53-2900-acd2-61b7-dbe621244981 for GC
2017/06/22 20:40:34.379971 [INFO] client: garbage collecting allocation "faf96d53-2900-acd2-61b7-dbe621244981"
2017/06/22 20:40:34.380948 [DEBUG] client: terminating runner for alloc 'faf96d53-2900-acd2-61b7-dbe621244981'
2017/06/22 20:40:34.380948 [DEBUG] client: garbage collected "faf96d53-2900-acd2-61b7-dbe621244981"
2017/06/22 20:40:34.380948 [DEBUG] client: starting task runners for alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97'
2017/06/22 20:40:34.380948 [DEBUG] client: starting task context for 'pgrunner' (alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97')
Allocation started only after GC of previous allocation.
Hmm thanks for the great repo steps! Will get this fixed by 0.6.0 and will update issue when we have a patch for you to test!
Sorry for the delay, I've been trying to reproduce your bug to make sure it's fixed before we release 0.6. Unfortunately I've been unable to reproduce it. We've made a lot of changes to our state file handling (related to the failed to restore state for alloc log messages), so I'm hoping your issue has been fixed!
Any chance you'd be able to test using these binaries?
linux_amd64.zip
windows_amd64.zip
If you still see the failed to restore state for alloc log message I would love to get a copy of your <state_dir>/state.db file. Please do not upload it to GitHub as it contains full copies of your jobs and allocations which may contain private data. Feel free to email me directly: [email protected]
Hi @schmichael
I am going to test binaries you provided.
Hi @schmichael
Unfortunately I have the same issue with the provided binaries
allocation.txt
nomad.log.txt
There is no "ailed to restore state for alloc" in log file.
If you need any additional files let me know.
Thanks for testing. I'm having a hard time reproducing this. Could you provide more details on how to reproduce this?
It appears you're running (at least) a 3 node cluster with the node in question running as a client+server. While that's not a recommended configuration it shouldn't cause bugs.
I'm also not sure what exactly happened in that last log file you posted as there's no long delay like in your initial report.
I found and fixed a small race condition in somewhat related code, but I think think it could cause the original behavior you described. I'm planning to do a larger refactor of the alloc blocking logic but it's unlikely to make it into 0.6.
Here's a build from the #2780 branch if you want to test:
windows_amd64.zip
@schmichael I have tried the last binary you provided, but was able to reproduce the same problem with it.
I have invested my weekend for understanding what is going on, and I think I found the bug. Looking forward for your feedback on proposed solution.