Awx: Jobs running on container instance groups are stuck in pending state

Created on 26 Oct 2020  ยท  4Comments  ยท  Source: ansible/awx

ISSUE TYPE
  • Bug Report
SUMMARY

Jobs running on container instance groups never start, stuck in pending state. I think this is related to the recent PR #8333.

This appears to be the line that the scheduler is erroring on:
https://github.com/ansible/awx/blob/d550487bc8cbdfd333e5881b9fc6192b63252d3a/awx/main/scheduler/task_manager.py#L287

In the PR an instances argument was added to that function:
https://github.com/ansible/awx/blob/d550487bc8cbdfd333e5881b9fc6192b63252d3a/awx/main/models/ha.py#L265

ENVIRONMENT
  • AWX version: 15.0.1
  • AWX install method: docker on linux
  • Ansible version: 2.9.14
  • Operating System: Red Hat Enterprise Linux 8.1
  • Web Browser: Chrome 86
STEPS TO REPRODUCE
  • Create a container instance group
  • Assign job to container instance group
  • Launch job
EXPECTED RESULTS

Container is created and job is completed

ACTUAL RESULTS

Job remains in pending state indefinitely

ADDITIONAL INFORMATION

The following keeps getting output to the awx_task log every few seconds:

2020-10-26T10:20:05.725390912Z 2020-10-26 10:20:05,722 DEBUG    awx.main.dispatch task c105fdd9-7077-4594-99da-1703ebc58db7 starting awx.main.scheduler.tasks.run_task_manager(*[])
2020-10-26T10:20:05.725420112Z 2020-10-26 10:20:05,724 DEBUG    awx.main.scheduler Running Tower task manager.
2020-10-26T10:20:05.732995992Z 2020-10-26 10:20:05,729 DEBUG    awx.main.scheduler Starting Scheduler
2020-10-26T10:20:05.831512837Z 2020-10-26 10:20:05,830 ERROR    awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(*[], **{}
2020-10-26T10:20:05.831538437Z Traceback (most recent call last):
2020-10-26T10:20:05.831543637Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 86, in perform_work
2020-10-26T10:20:05.831547937Z     result = self.run_callable(body)
2020-10-26T10:20:05.831551637Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 62, in run_callable
2020-10-26T10:20:05.831555537Z     return _call(*args, **kwargs)
2020-10-26T10:20:05.831559637Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/tasks.py", line 16, in run_task_manager
2020-10-26T10:20:05.831563537Z     TaskManager().schedule()
2020-10-26T10:20:05.831587738Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 644, in schedule
2020-10-26T10:20:05.831601338Z     self._schedule()
2020-10-26T10:20:05.831604738Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 632, in _schedule
2020-10-26T10:20:05.831608038Z     self.process_tasks(all_sorted_tasks)
2020-10-26T10:20:05.831611038Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 598, in process_tasks
2020-10-26T10:20:05.831614238Z     self.process_pending_tasks(pending_tasks)
2020-10-26T10:20:05.831635138Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 511, in process_pending_tasks
2020-10-26T10:20:05.831639038Z     self.start_task(task, rampart_group, task.get_jobs_fail_chain(), None)
2020-10-26T10:20:05.831642338Z   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 287, in start_task
2020-10-26T10:20:05.831645738Z     match = group.fit_task_to_most_remaining_capacity_instance(task)
2020-10-26T10:20:05.831649238Z TypeError: fit_task_to_most_remaining_capacity_instance() missing 1 required positional argument: 'instances'
api high bug

Most helpful comment

Hey @paulstaffs,

This was an oversight on our part from some recent optimizations to the task manager. It should be addressed in the next release (this PR contains the fix):

https://github.com/ansible/awx/pull/8457
https://github.com/ansible/awx/pull/8457/files#diff-a37220424979b0075fa6e25bf8c309d671f30641e64830c27cd046272c73a703R288

All 4 comments

Hey @paulstaffs,

This was an oversight on our part from some recent optimizations to the task manager. It should be addressed in the next release (this PR contains the fix):

https://github.com/ansible/awx/pull/8457
https://github.com/ansible/awx/pull/8457/files#diff-a37220424979b0075fa6e25bf8c309d671f30641e64830c27cd046272c73a703R288

Going to close this as our downstream tests are now passing, and fix will go out in next release

I'm seeing this issue here on the latest pull from the devel branch. I'm still getting this error here below:

2020-12-08 17:05:28,678 DEBUG awx.main.dispatch task f800c829-7790-4052-b434-08b535ca89c4 starting awx.main.scheduler.tasks.run_task_manager([])
2020-12-08 17:05:28,680 DEBUG awx.main.scheduler Running Tower task manager.
2020-12-08 17:05:28,685 DEBUG awx.main.scheduler Starting Scheduler
2020-12-08 17:05:28,774 ERROR awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(
[], *{}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 86, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 62, in run_callable
return _call(
args, **kwargs)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/tasks.py", line 16, in run_task_manager
TaskManager().schedule()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 644, in schedule
self._schedule()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 632, in _schedule
self.process_tasks(all_sorted_tasks)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 598, in process_tasks
self.process_pending_tasks(pending_tasks)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 511, in process_pending_tasks
self.start_task(task, rampart_group, task.get_jobs_fail_chain(), None)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 287, in start_task
match = group.fit_task_to_most_remaining_capacity_instance(task)
TypeError: fit_task_to_most_remaining_capacity_instance() missing 1 required positional argument: 'instances'

Please advise on what I need to do to get this resolved. Thank you!

@emanuelferguson you might want to make sure you're actually deploying the latest AWX release. The line number you've referenced (287) doesn't match this source line in devel:

https://github.com/ansible/awx/blame/devel/awx/main/scheduler/task_manager.py#L288

Was this page helpful?
0 / 5 - 0 ratings