Nomad: raw_exec job started after 2.5 hours after scheduled on windows

Created on 14 Apr 2017 · 17Comments · Source: hashicorp/nomad

Nomad version

0.5.4

Operating system and Environment details

5 windows boxes = 3 servers + 2 clients

Issue

After nomad client restart, some of scheduled jobs on that client was in pending state for more than two hours, without any attempt to start the jobs. This is severe issue for us and we would like to have it fixed ASAP.

Reproduction steps

I don't have exact steps. We know that this sometimes occurs after client restart . But we definitely faced with this problem more than once.

Nomad Client logs (if appropriate)

The full nomad client logs here
https://gist.github.com/capone212/94f47912f1f9e700195d9b45e846b7e3

On of problem jobs is "ngprunner.Node001"
When problem was reproducing i made following observations

>nomad status ngprunner.Node001
ID            = ngprunner.Node001
Name          = ngprunner.Node001
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ngp         0       1         0        0       0         1
postgres    0       0         1        0       0         1

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
2748409b  209b0121  34ce800b  ngp         run      pending  04/14/17 11:25:09 EAT
94ec5139  209b0121  34ce800b  postgres    run      running  04/14/17 11:25:09 EAT
3d170428  0224aabc  34ce800b  postgres    stop     lost     04/14/17 10:39:24 EAT
b0741602  0224aabc  34ce800b  ngp         stop     lost     04/14/17 10:39:24 EAT



>nomad alloc-status -verbose 2748409b

ID                  = 2748409b-ad2c-4008-9dde-72a38a01480e
Eval ID             = 209b0121-81ec-ebf0-6616-a8a403136673
Name                = ngprunner.Node001.ngp[0]
Node ID             = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID              = ngprunner.Node001
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 04/14/17 11:25:09 EAT
Evaluated Nodes     = 4
Filtered Nodes      = 3
Exhausted Nodes     = 0
Allocation Time     = 0s
Failures            = 0

←[1mPlacement Metrics←[0m←[0m
  * Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000

You can see that allocation 2748409b in pending state. But there were no attempt to start or restart that allocation on host machine.

This is extraction from nomad.log about the job

    2017/04/14 10:39:24.687255 [DEBUG] client: starting task runners for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be'
    2017/04/14 10:39:24.687255 [DEBUG] client: starting task context for 'ngprunner' (alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be')
    2017/04/14 10:39:24 [DEBUG] plugin: starting plugin: C:\Program Files\RaftLauncher\current\nomad.exe []string{"C:\\Program Files\\RaftLauncher\\current\\nomad.exe", "executor", "{\"LogFile\":\"C:\\\\ProgramData\\\\AxxonSoft\\\\AxxonNext\\\\RaFT\\\\cluster\\\\nomad\\\\data\\\\alloc\\\\b0741602-7127-f66e-5aaf-d4d1df1397be\\\\ngprunner\\\\executor.out\",\"LogLevel\":\"DEBUG\"}"}
    2017/04/14 11:22:52.439567 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
    2017/04/14 11:25:00.300414 [ERR] client: failed to open handle to task 'ngprunner' for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be': error connecting to plugin: error creating rpc client for executor plugin: OpenProcess: The parameter is incorrect.
    2017/04/14 11:25:00.300414 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
    2017/04/14 11:25:09.286548 [DEBUG] client: added alloc "2748409b-ad2c-4008-9dde-72a38a01480e" to blocked queue for previous allocation "b0741602-7127-f66e-5aaf-d4d1df1397be"

Allocation 2748409b was in "blocked queue". And after more than two hours allocation suddenly started

    2017/04/14 13:40:09.807652 [DEBUG] client: starting task runners for alloc '2748409b-ad2c-4008-9dde-72a38a01480e'
    2017/04/14 13:40:09.807652 [DEBUG] client: starting task context for 'ngprunner' (alloc '2748409b-ad2c-4008-9dde-72a38a01480e')

themclient typbug

Source

capone212

All 17 comments

This is how allocation is looking after jobs suddenly started

ID                  = 2748409b-ad2c-4008-9dde-72a38a01480e                                                                               13:58
Eval ID             = 209b0121-81ec-ebf0-6616-a8a403136673
Name                = ngprunner.Node001.ngp[0]
Node ID             = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID              = ngprunner.Node001
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 04/14/17 11:25:09 EAT
Evaluated Nodes     = 4
Filtered Nodes      = 3
Exhausted Nodes     = 0
Allocation Time     = 0s
Failures            = 0


←[1mTask "ngprunner" is "running"←[0m←[0m
Task Resources
CPU        Memory   Disk  IOPS  Addresses
10000 MHz  2.9 GiB  0 B   0     NGP_BASE: 10.0.15.1:1000

Recent Events:
Time                   Type      Description
04/14/17 13:40:11 EAT  Started   Task started by client
04/14/17 13:40:09 EAT  Received  Task received by client

←[1mPlacement Metrics←[0m←[0m
  * Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000

capone212 on 14 Apr 2017

Can you give the output of: curl http://127.0.0.1:4646/v1/node/<full node-id>/allocations

dadgar on 14 Apr 2017

@capone212 That should fix it! Master should be bug free if you want to build and role out master!

dadgar on 14 Apr 2017

@dadgar thanks a lot for your quick fix!!!

capone212 on 15 Apr 2017

👍1

Hi @dadgar I have just noticed that unfortunately fix does not help. Good news through I can reproduce it easily now in my environment.
With 3 nodes in server+client mode I do the following:
1) Create pinned job to one specific server : job.Constrain(apiNomad.NewConstraint("${node.unique.name}", "=", agentId))
2) Run job
3) power off that specific server
4) wait 1 minute, update that job
5) up server

At that point roughly at 30-50% cases we will have task with pending state.
The following is debug data from my last test with your fix applied.

>nomad status  pgrunner.cxRPaR97LE.SRV1
ID            = pgrunner.cxRPaR97LE.SRV1
Name          = pgrunner.cxRPaR97LE.SRV1
Type          = service
Priority      = 60
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
postgres    0       1         1        0       4         0

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status    Created At
0eac3064  75bbb534  f303c409  postgres    run      pending   06/22/17 19:13:06 BOT
19fb4725  ea67e6e3  f303c409  postgres    stop     complete  06/22/17 19:09:08 BOT
75443349  4bf104fd  f303c409  postgres    stop     complete  06/22/17 19:08:02 BOT
05569c70  08f92ee3  f303c409  postgres    stop     running   06/22/17 19:05:07 BOT
d28605a7  a8e408f1  f303c409  postgres    stop     complete  06/22/17 19:02:19 BOT
a0284ddf  2992e4a9  f303c409  postgres    stop     complete  06/22/17 18:58:46 BOT

From logs:

    2017/06/22 19:13:06.348691 [DEBUG] client: added alloc "0eac3064-a10f-1eb0-f7e2-1c540dd29b97" to blocked queue for previous allocation "19fb4725-34fa-5e15-91b3-c5ce6899e860"

From allocations

1   
ID  "0eac3064-a10f-1eb0-f7e2-1c540dd29b97"
EvalID  "75bbb534-6f17-9edf-e314-9aed533ffdbe"
Name    "pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID  "f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID   "pgrunner.cxRPaR97LE.SRV1"
TaskGroup   "postgres"
DesiredStatus   "run"
DesiredDescription  ""
ClientStatus    "pending"
ClientDescription   ""
TaskStates  
CreateIndex 2041
ModifyIndex 2041
CreateTime  1498173186329835000
2   
ID  "19fb4725-34fa-5e15-91b3-c5ce6899e860"
EvalID  "ea67e6e3-6280-0358-0368-519e03ee1740"
Name    "pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID  "f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID   "pgrunner.cxRPaR97LE.SRV1"
TaskGroup   "postgres"
DesiredStatus   "stop"
DesiredDescription  "alloc is lost since its node is down"
ClientStatus    "complete"
ClientDescription   ""
TaskStates  
pgrunner    
State   "dead"
Failed  false
StartedAt   "2017-06-22T23:09:09.5177181Z"
FinishedAt  "2017-06-22T23:12:57.7315015Z"
Events  
0   
Type    "Received"
Time    1498172948476707300
FailsTask   false
RestartReason   ""
SetupError  ""
DriverError ""
ExitCode    0
Signal  0
Message ""
KillTimeout 0
KillError   ""
KillReason  ""
StartDelay  0
DownloadError   ""
ValidationError ""
DiskLimit   0
FailedSibling   ""
VaultError  ""
TaskSignalReason    ""
TaskSignal  ""
DriverMessage   ""
1   
Type    "Task Setup"
Time    1498172948476707300
FailsTask   false
RestartReason   ""
SetupError  ""
DriverError ""
ExitCode    0
Signal  0
Message "Building Task Directory"
KillTimeout 0
KillError   ""
KillReason  ""
StartDelay  0
DownloadError   ""
ValidationError ""
DiskLimit   0
FailedSibling   ""
VaultError  ""
TaskSignalReason    ""
TaskSignal  ""
DriverMessage   ""
2   
Type    "Started"
Time    1498172949517718000
FailsTask   false
RestartReason   ""
SetupError  ""
DriverError ""
ExitCode    0
Signal  0
Message ""
KillTimeout 0
KillError   ""
KillReason  ""
StartDelay  0
DownloadError   ""
ValidationError ""
DiskLimit   0
FailedSibling   ""
VaultError  ""
TaskSignalReason    ""
TaskSignal  ""
DriverMessage   ""
3   
Type    "Started"
Time    1498173175443420200
FailsTask   false
RestartReason   ""
SetupError  ""
DriverError ""
ExitCode    0
Signal  0
Message ""
KillTimeout 0
KillError   ""
KillReason  ""
StartDelay  0
DownloadError   ""
ValidationError ""
DiskLimit   0
FailedSibling   ""
VaultError  ""
TaskSignalReason    ""
TaskSignal  ""
DriverMessage   ""
4   
Type    "Killing"
Time    1498173175443420200
FailsTask   false
RestartReason   ""
SetupError  ""
DriverError ""
ExitCode    0
Signal  0
Message ""
KillTimeout 5000000000
KillError   ""
KillReason  ""
StartDelay  0
DownloadError   ""
ValidationError ""
DiskLimit   0
FailedSibling   ""
VaultError  ""
TaskSignalReason    ""
TaskSignal  ""
DriverMessage   ""
5   
Type    "Killed"
Time    1498173177579159800
FailsTask   false
RestartReason   ""
SetupError  ""
DriverError ""
ExitCode    0
Signal  0
Message ""
KillTimeout 0
KillError   ""
KillReason  ""
StartDelay  0
DownloadError   ""
ValidationError ""
DiskLimit   0
FailedSibling   ""
VaultError  ""
TaskSignalReason    ""
TaskSignal  ""
DriverMessage   ""
CreateIndex 1989
ModifyIndex 2034
CreateTime  1498172948382785500

capone212 on 23 Jun 2017

Please find full log and full allocations attached:

allocations.txt

nomad.log.txt

@dadgar , If you need any help please let me know. As I sad, I can reproduce it and eager to test patches.

capone212 on 23 Jun 2017

And yes, after several hours, job finally started!

ID            = pgrunner.cxRPaR97LE.SRV1
Name          = pgrunner.cxRPaR97LE.SRV1
Type          = service
Priority      = 60
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
postgres    0       0         2        0       4         0

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
0eac3064  75bbb534  f303c409  postgres    run      running  06/22/17 19:13:06 BOT

capone212 on 23 Jun 2017

nomad logs after ~2 hours
newAllocs.txt

capone212 on 23 Jun 2017

The essence of what is going on

[ERR] client: cleanup allocation faf96d53-2900-acd2-61b7-dbe621244981==> Nomad agent configuration:
    2017/06/22 19:12:54.403378 [ERR] client: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
    2017/06/22 19:12:54.403378 [ERR] client: failed to restore state for alloc faf96d53-2900-acd2-61b7-dbe621244981: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
    2017/06/22 19:12:56.662167 [DEBUG] client: alloc "faf96d53-2900-acd2-61b7-dbe621244981" in terminal status, waiting for destroy
    2017/06/22 19:12:56.662167 [DEBUG] client: added alloc "c75d7468-0055-73ca-1411-bfb932f9de7b" to blocked queue for previous allocation "faf96d53-2900-acd2-61b7-dbe621244981"
    2017/06/22 20:40:34.379971 [INFO] client: marking allocation faf96d53-2900-acd2-61b7-dbe621244981 for GC
    2017/06/22 20:40:34.379971 [INFO] client: garbage collecting allocation "faf96d53-2900-acd2-61b7-dbe621244981"
    2017/06/22 20:40:34.380948 [DEBUG] client: terminating runner for alloc 'faf96d53-2900-acd2-61b7-dbe621244981'
    2017/06/22 20:40:34.380948 [DEBUG] client: garbage collected "faf96d53-2900-acd2-61b7-dbe621244981"

    2017/06/22 20:40:34.380948 [DEBUG] client: starting task runners for alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97'
    2017/06/22 20:40:34.380948 [DEBUG] client: starting task context for 'pgrunner' (alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97')

capone212 on 23 Jun 2017

Allocation started only after GC of previous allocation.

capone212 on 23 Jun 2017

Hmm thanks for the great repo steps! Will get this fixed by 0.6.0 and will update issue when we have a patch for you to test!

dadgar on 23 Jun 2017

👍1

Sorry for the delay, I've been trying to reproduce your bug to make sure it's fixed before we release 0.6. Unfortunately I've been unable to reproduce it. We've made a lot of changes to our state file handling (related to the failed to restore state for alloc log messages), so I'm hoping your issue has been fixed!

Any chance you'd be able to test using these binaries?

linux_amd64.zip
windows_amd64.zip

If you still see the failed to restore state for alloc log message I would love to get a copy of your <state_dir>/state.db file. Please do not upload it to GitHub as it contains full copies of your jobs and allocations which may contain private data. Feel free to email me directly: [email protected]

schmichael on 4 Jul 2017

👍1

Hi @schmichael
I am going to test binaries you provided.

capone212 on 5 Jul 2017

Hi @schmichael
Unfortunately I have the same issue with the provided binaries
allocation.txt
nomad.log.txt

capone212 on 5 Jul 2017

There is no "ailed to restore state for alloc" in log file.
If you need any additional files let me know.

capone212 on 5 Jul 2017

Thanks for testing. I'm having a hard time reproducing this. Could you provide more details on how to reproduce this?

It appears you're running (at least) a 3 node cluster with the node in question running as a client+server. While that's not a recommended configuration it shouldn't cause bugs.

I'm also not sure what exactly happened in that last log file you posted as there's no long delay like in your initial report.

I found and fixed a small race condition in somewhat related code, but I think think it could cause the original behavior you described. I'm planning to do a larger refactor of the alloc blocking logic but it's unlikely to make it into 0.6.

Here's a build from the #2780 branch if you want to test:
windows_amd64.zip

schmichael on 6 Jul 2017

@schmichael I have tried the last binary you provided, but was able to reproduce the same problem with it.

I have invested my weekend for understanding what is going on, and I think I found the bug. Looking forward for your feedback on proposed solution.

capone212 on 9 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings