Awx: Job stays stuck in pending with no other jobs running

Created on 27 Sep 2017  路  22Comments  路  Source: ansible/awx

ISSUE TYPE

  • Bug Report
COMPONENT NAME

  • UI
SUMMARY


Any job i try to run just stays in pending. I have no other jobs running

ENVIRONMENT
STEPS TO REPRODUCE

Can't find a way

EXPECTED RESULTS

Job should run

ACTUAL RESULTS

job staying in pendning

ADDITIONAL INFORMATION

output from awx_task log:

[2017-09-27 21:53:56,680: DEBUG/Worker-42] Running Tower task manager.
2017-09-27 21:53:56,690 DEBUG    awx.main.scheduler Starting Scheduler
2017-09-27 21:53:56,690 DEBUG    awx.main.scheduler Starting Scheduler
[2017-09-27 21:53:56,690: DEBUG/Worker-42] Starting Scheduler
2017-09-27 21:53:56,732 DEBUG    awx.main.scheduler project_update 123 (pending) is blocked from running
2017-09-27 21:53:56,732 DEBUG    awx.main.scheduler project_update 123 (pending) is blocked from running
[2017-09-27 21:53:56,732: DEBUG/Worker-42] project_update 123 (pending) is blocked from running
[2017-09-27 21:53:56,748: INFO/MainProcess] Task awx.main.scheduler.tasks.run_task_manager[a1bc3e85-bf92-47dc-bce7-4dc2c94a0e2d] succeeded in 0.0872339019988s: None
[2017-09-27 21:54:16,590: INFO/MainProcess] Received task: awx.main.scheduler.tasks.run_task_manager[5f2404ff-bc40-4d0e-8732-28715e1e3dd6] expires:[2017-09-27 21:54:36.587988+00:00]
[2017-09-27 21:54:16,591: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x5c62668> (args:(u'awx.main.scheduler.tasks.run_task_manager', u'5f2404ff-bc40-4d0e-8732-28715e1e3dd6', [], {}, {u'utc': True, u'is_eager': False, u'chord': None, u'group': None, u'args': [], u'retries': 0, u'delivery_info': {u'priority': None, u'redelivered': False, u'routing_key': u'tower', u'exchange': u'tower'}, u'expires': u'2017-09-27T21:54:36.587988+00:00', u'hostname': 'celery@localhost', u'task': u'awx.main.scheduler.tasks.run_task_manager', u'callbacks': None, u'correlation_id': u'5f2404ff-bc40-4d0e-8732-28715e1e3dd6', u'errbacks': None, u'timelimit': [None, None], u'taskset': None, u'kwargs': {}, u'eta': None, u'reply_to': u'64ec93c0-61c0-3f6d-8363-645f9f0796f5', u'id': u'5f2404ff-bc40-4d0e-8732-28715e1e3dd6', u'headers': {}}) kwargs:{})
[2017-09-27 21:54:16,594: DEBUG/MainProcess] Task accepted: awx.main.scheduler.tasks.run_task_manager[5f2404ff-bc40-4d0e-8732-28715e1e3dd6] pid:198
[2017-09-27 21:54:16,586: INFO/MainProcess] Scheduler: Sending due task task_manager (awx.main.scheduler.tasks.run_task_manager)
[2017-09-27 21:54:16,588: DEBUG/MainProcess] awx.main.scheduler.tasks.run_task_manager sent. id->5f2404ff-bc40-4d0e-8732-28715e1e3dd6
[2017-09-27 21:54:16,594: DEBUG/MainProcess] beat: Waking up in 9.95 seconds.
2017-09-27 21:54:16,617 DEBUG    awx.main.scheduler Running Tower task manager.
2017-09-27 21:54:16,617 DEBUG    awx.main.scheduler Running Tower task manager.
[2017-09-27 21:54:16,617: DEBUG/Worker-42] Running Tower task manager.
2017-09-27 21:54:16,624 DEBUG    awx.main.scheduler Starting Scheduler
2017-09-27 21:54:16,624 DEBUG    awx.main.scheduler Starting Scheduler
[2017-09-27 21:54:16,624: DEBUG/Worker-42] Starting Scheduler
2017-09-27 21:54:16,659 DEBUG    awx.main.scheduler project_update 123 (pending) is blocked from running
2017-09-27 21:54:16,659 DEBUG    awx.main.scheduler project_update 123 (pending) is blocked from running
[2017-09-27 21:54:16,659: DEBUG/Worker-42] project_update 123 (pending) is blocked from running
[2017-09-27 21:54:16,669: INFO/MainProcess] Task awx.main.scheduler.tasks.run_task_manager[5f2404ff-bc40-4d0e-8732-28715e1e3dd6] succeeded in 0.0774938369996s: None
api medium needs_info bug

Most helpful comment

@matburt I faced the same problem several time. I did not find a way to solve it. Please investigate :)

All 22 comments

Looks like something is definitely stuck... do you see anything when hitting the api:

towerhost/api/v2/unified_jobs/?status=running ?

@dp19 My best guess is that there are no groups/nodes that the system has identified that can run the job. The reason this might be happening is the Instance known by awx is different from the host/server that awx is currently running on.

awx-manage shell_plus
is = Instance.objects.all()
is # There should be only 1 and only 1
i = is[0]
i.hostname
settings.CLUSTER_HOST_ID

If an instance is not found then the instance failed to register. Also, the i.hostname and settings.CLUSTER_HOST_ID should be the same value.

Are you using the openshift or minishift install method?

@chrismeyersfsu I am not, i'm using the local docker installation. I backed up my local postgres data and reinstalled AWX from scratch today, I'll spin back with the old data later today to get the information @matburt requested.

Thanks!

@matburt the api is saying one job is currently running though I don't see it at all in the UI, 113

{
"count": 1,
"next": null,
"previous": null,
"results": [
{
"id": 113,
"type": "project_update",
"url": "/api/v2/project_updates/113/",
"related": {
"credential": "/api/v2/credentials/2/",
"unified_job_template": "/api/v2/projects/6/",
"stdout": "/api/v2/project_updates/113/stdout/",
"cancel": "/api/v2/project_updates/113/cancel/",
"notifications": "/api/v2/project_updates/113/notifications/",
"scm_inventory_updates": "/api/v2/project_updates/113/scm_inventory_updates/",
"project": "/api/v2/projects/6/"
},

@matburt I issued a curl call to cancel the job and that seems to have solved my issue, though I'm not sure why it happened in the first place. I'll leave this environment up if anyone needs me to look deeper into what happened.

Was there any more detail on the job? What was that project update trying to do?

It was performing an scm update with my playbooks, from a quick glance I couldn鈥檛 see anything out of place, it was just in a running state

I reckon it could have gotten stuck on something... very odd indeed. Any reason why that would hang perpetually?

Nothing in particular, it鈥檚 a small repo with around 20 playbooks in it. I was working on updating the zabbix install to work with awx, I ran an update then tried running the template for the zabbix install and that鈥檚 the one that hung. I did cancel it and it showed canceled in the UI but after that I was unable to launch a new job

Have you experienced this any more? Are you able to reproduce this consistently?

@matburt I haven't seen this issue again and I'm unable to reproduce it. I'll close this out

Same issue here, let me know if I should open a separate issue.

root@ansible:/opt/awx# tower-cli job list --status running === ============ ======================== ======= ============= id job_template created status elapsed === ============ ======================== ======= ============= 842 9 2017-10-04T05:00:25.856Z running 1179365.16963 === ============ ======================== ======= =============

root@ansible:/opt/awx# tower-cli job cancel 842 OK. (changed: true)

However the job is still there :

root@ansible:/opt/awx# tower-cli job list --status running === ============ ======================== ======= ============= id job_template created status elapsed === ============ ======================== ======= ============= 842 9 2017-10-04T05:00:25.856Z running 1179416.91637 === ============ ======================== ======= =============

Looking at the logs, the issue seems to have been created by a server asking for a password (key probably not installed on it) as the last line of the job is [email protected]'s password

Here is the error while trying to stop the job [2017-10-17 20:48:22,639: INFO/MainProcess] Received task: awx.main.scheduler.tasks.run_task_manager[974febd8-3eab-4c22-be32-c0e99e18fd83] expires:[2017-10-17 20:48:42.631291+00:00] [2017-10-17 20:48:22,641: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x58fdc80> (args:(u'awx.main.scheduler.tasks.run_task_manager', u'974febd8-3eab-4c22-be32-c0e99e18fd83', [], {}, {u'utc': True, u'is_eager': False, u'chord': None, u'group': None, u'args': [], u'retries': 0, u'delivery_info': {u'priority': None, u'redelivered': False, u'routing_key': u'tower', u'exchange': u'tower'}, u'expires': u'2017-10-17T20:48:42.631291+00:00', u'hostname': 'celery@localhost', u'task': u'awx.main.scheduler.tasks.run_task_manager', u'callbacks': None, u'correlation_id': u'974febd8-3eab-4c22-be32-c0e99e18fd83', u'errbacks': None, u'timelimit': [None, None], u'taskset': None, u'kwargs': {}, u'eta': None, u'reply_to': u'de2131e4-dd2e-3564-bf5b-b6bd78e6478d', u'id': u'974febd8-3eab-4c22-be32-c0e99e18fd83', u'headers': {}}) kwargs:{})

Seems more related to : https://github.com/ansible/awx/issues/378

I also have a similar issue which is causing me problems.

A job is stuck on pending and no matter what I do (container restart, job cancellation, awx update etc) it will not remove itself.

I believe the issue is with the inventory the job template is using. All job templates assigned to the same inventory are being queued. If I try to force an inventory sync (aws) it just states 'Waiting for results...' in the output.

To get around the issue I just had to create a new inventory and manually update all templates to use said inventory (zzz).

Is there anything I can try to resolve this? I'd like to understand what the issue is so if it happens in future I won't have to go through the above steps again.

Is the inventory update getting stuck in that situation?

Yes - For me the associated Inventory AND project both get stuck.

I have a job that is permanently stuck in running ec2 inventory update, with some 6000 pending other inventory updates as the scheduled jobs are now stuck.

@matburt I faced the same problem several time. I did not find a way to solve it. Please investigate :)

Are u able to solve this issue please?

We just to bit by something that looks like this.

SSH to the server and run gitlab-runner run to solve the problem in the foreground while use gitlab-runner run & in the background.

I have similar issue and solve it by stopping the process related to running play-book. You can identify running process with below command.

ps -eo pid,lstart,cmd

Was this page helpful?
0 / 5 - 0 ratings

Related issues

augabet picture augabet  路  3Comments

marshmalien picture marshmalien  路  3Comments

mwiora picture mwiora  路  3Comments

artmakh picture artmakh  路  3Comments

astraios picture astraios  路  3Comments