Job remains in pending, In the log I see a lot of "awx.main.tasks Not running scheduler, another task holds lock"
There are some closed issues regarding this one, but didn't find any solution.
Randomly AWX get stuck where all jobs remain in "pending"
In the logs I see
task_1 | 2020-01-09 16:08:59,344 DEBUG awx.main.dispatch task a1c42d7f-17b2-4590-b013-b82f76105f0c starting awx.main.scheduler.tasks.run_task_manager(*[])
task_1 | 2020-01-09 16:08:59,347 DEBUG awx.main.scheduler Running Tower task manager.
task_1 | 2020-01-09 16:08:59,359 DEBUG awx.main.scheduler Not running scheduler, another task holds lock
task_1 | 2020-01-09 16:08:59,360 DEBUG awx.main.dispatch task a1c42d7f-17b2-4590-b013-b82f76105f0c is finished
After a lot of time (this time 16min) it suddenly starts to execute the task
task_1 | 2020-01-09 16:24:19,814 DEBUG awx.main.dispatch task b832fea9-c31f-4cfb-8ce3-fda16a1ac5cb starting awx.main.scheduler.tasks.run_task_manager(*[])
task_1 | 2020-01-09 16:24:19,816 DEBUG awx.main.scheduler Running Tower task manager.
task_1 | 2020-01-09 16:24:19,827 DEBUG awx.main.scheduler Not running scheduler, another task holds lock
task_1 | 2020-01-09 16:24:19,828 DEBUG awx.main.dispatch task b832fea9-c31f-4cfb-8ce3-fda16a1ac5cb is finished
task_1 | 2020-01-09 16:24:38,979 DEBUG awx.main.dispatch publish awx.main.tasks.cluster_node_heartbeat(aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399, queue=awx)
task_1 | [2020-01-09 16:24:38,979: DEBUG/Process-1] publish awx.main.tasks.cluster_node_heartbeat(aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399, queue=awx)
rabbitmq_1 | 2020-01-09 16:24:38.989 [info] <0.3775.0> accepting AMQP connection <0.3775.0> (172.19.0.5:48346 -> 172.19.0.3:5672)
rabbitmq_1 | 2020-01-09 16:24:38.991 [info] <0.3775.0> connection <0.3775.0> (172.19.0.5:48346 -> 172.19.0.3:5672): user 'guest' authenticated and granted access to vhost 'awx'
rabbitmq_1 | 2020-01-09 16:24:38.996 [info] <0.3775.0> closing AMQP connection <0.3775.0> (172.19.0.5:48346 -> 172.19.0.3:5672, vhost: 'awx', user: 'guest')
task_1 | 2020-01-09 16:24:39,003 DEBUG awx.main.dispatch delivered aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399 to worker[178] qsize 0
task_1 | 2020-01-09 16:24:39,005 DEBUG awx.main.dispatch task aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399 starting awx.main.tasks.cluster_node_heartbeat(*[])
task_1 | 2020-01-09 16:24:39,008 DEBUG awx.main.tasks Cluster node heartbeat task.
task_1 | 2020-01-09 16:24:39,020 DEBUG awx.main.dispatch task aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399 is finished
task_1 | 2020-01-09 16:24:39,090 DEBUG awx.main.dispatch publish awx.main.tasks.awx_k8s_reaper(e33c3d96-163b-487f-8101-00dd80f101a7, queue=awx)
task_1 | [2020-01-09 16:24:39,090: DEBUG/Process-1] publish awx.main.tasks.awx_k8s_reaper(e33c3d96-163b-487f-8101-00dd80f101a7, queue=awx)
rabbitmq_1 | 2020-01-09 16:24:39.100 [info] <0.3788.0> accepting AMQP connection <0.3788.0> (172.19.0.5:48350 -> 172.19.0.3:5672)
rabbitmq_1 | 2020-01-09 16:24:39.102 [info] <0.3788.0> connection <0.3788.0> (172.19.0.5:48350 -> 172.19.0.3:5672): user 'guest' authenticated and granted access to vhost 'awx'
rabbitmq_1 | 2020-01-09 16:24:39.104 [info] <0.3788.0> closing AMQP connection <0.3788.0> (172.19.0.5:48350 -> 172.19.0.3:5672, vhost: 'awx', user: 'guest')
task_1 | 2020-01-09 16:24:39,105 DEBUG awx.main.dispatch delivered e33c3d96-163b-487f-8101-00dd80f101a7 to worker[178] qsize 0
task_1 | 2020-01-09 16:24:39,108 DEBUG awx.main.dispatch task e33c3d96-163b-487f-8101-00dd80f101a7 starting awx.main.tasks.awx_k8s_reaper(*[])
task_1 | 2020-01-09 16:24:39,117 DEBUG awx.main.dispatch task e33c3d96-163b-487f-8101-00dd80f101a7 is finished
task_1 | 2020-01-09 16:24:39,344 DEBUG awx.main.dispatch publish awx.main.tasks.awx_periodic_scheduler(97554fd6-f2c2-4b5f-a25d-bf09743b9105, queue=awx_private_queue)
task_1 | [2020-01-09 16:24:39,344: DEBUG/Process-1] publish awx.main.tasks.awx_periodic_scheduler(97554fd6-f2c2-4b5f-a25d-bf09743b9105, queue=awx_private_queue)
rabbitmq_1 | 2020-01-09 16:24:39.354 [info] <0.3801.0> accepting AMQP connection <0.3801.0> (172.19.0.5:48354 -> 172.19.0.3:5672)
rabbitmq_1 | 2020-01-09 16:24:39.356 [info] <0.3801.0> connection <0.3801.0> (172.19.0.5:48354 -> 172.19.0.3:5672): user 'guest' authenticated and granted access to vhost 'awx'
task_1 | 2020-01-09 16:24:39,359 DEBUG awx.main.dispatch delivered 97554fd6-f2c2-4b5f-a25d-bf09743b9105 to worker[178] qsize 0
rabbitmq_1 | 2020-01-09 16:24:39.359 [info] <0.3801.0> closing AMQP connection <0.3801.0> (172.19.0.5:48354 -> 172.19.0.3:5672, vhost: 'awx', user: 'guest')
task_1 | 2020-01-09 16:24:39,361 DEBUG awx.main.dispatch task 97554fd6-f2c2-4b5f-a25d-bf09743b9105 starting awx.main.tasks.awx_periodic_scheduler(*[])
task_1 | 2020-01-09 16:24:39,377 DEBUG awx.main.tasks Starting periodic scheduler
task_1 | 2020-01-09 16:24:39,380 DEBUG awx.main.tasks Last scheduler run was: 2020-01-09 16:05:39.310989+00:00
task_1 | 2020-01-09 16:24:39,390 DEBUG awx.main.dispatch task 97554fd6-f2c2-4b5f-a25d-bf09743b9105 is finished
Is there a way to know which other task blocks the scheduler?
what infos do you need?
What sort of deployment do you have - what is the topology?
If this is in kubernetes, what is your pod status?
I'm using docker-compose with an external Postgres DB.
It's a AWX 3.x version (where we had the same issue) which I upgraded to 9.x
version: '2'
services:
nginx:
image: nginx:stable
restart: unless-stopped
volumes:
- "~/docker_projects/awx/nginx_ssl.conf:/etc/nginx/nginx.conf:ro"
- "~/docker_projects/awx/cert/:/etc/nginx/certs/"
ports:
- "80:80"
- "443:443"
depends_on:
- web
web:
image: ansible/awx_web:9.1.0
container_name: awx_web
depends_on:
- rabbitmq
- memcached
ports:
- "8052"
hostname: awxweb
user: root
restart: unless-stopped
volumes:
- "~/docker_projects/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "~/docker_projects/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "~/docker_projects/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "~/docker_projects/awx/nginx.conf:/etc/nginx/nginx.conf:ro"
environment:
http_proxy:
https_proxy:
no_proxy:
task:
image: ansible/awx_task:9.1.0
container_name: awx_task
depends_on:
- rabbitmq
- memcached
- web
hostname: awx
user: root
restart: unless-stopped
volumes:
- "~/docker_projects/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "~/docker_projects/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "~/docker_projects/awx/credentials.py:/etc/tower/conf.d/credentials.py"
environment:
http_proxy:
https_proxy:
no_proxy:
rabbitmq:
image: ansible/awx_rabbitmq:3.7.4
container_name: awx_rabbitmq
restart: unless-stopped
environment:
RABBITMQ_DEFAULT_VHOST: "awx"
RABBITMQ_DEFAULT_USER: "guest"
RABBITMQ_DEFAULT_PASS: "awxpass"
RABBITMQ_ERLANG_COOKIE: cookiemonster
http_proxy:
https_proxy:
no_proxy:
memcached:
image: "memcached:alpine"
container_name: awx_memcached
restart: unless-stopped
environment:
http_proxy:
https_proxy:
no_proxy:
The containers them self run fine
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
24f48f26e685 nginx:stable "nginx -g 'daemon of…" 10 days ago Up 10 days 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp awx_nginx_1
caece6b0501a ansible/awx_web:9.1.0 "/tini -- /bin/sh -c…" 10 days ago Up 10 days 0.0.0.0:32771->8052/tcp awx_web
eddb2f852093 ansible/awx_task:9.1.0 "/tini -- /bin/sh -c…" 10 days ago Up 10 days 8052/tcp awx_task
8737f57877f1 memcached:alpine "docker-entrypoint.s…" 10 days ago Up 10 days 11211/tcp awx_memcached
7a824047110d ansible/awx_rabbitmq:3.7.4 "docker-entrypoint.s…" 10 days ago Up 10 days 4369/tcp, 5671-5672/tcp, 15671-15672/tcp, 25672/tcp awx_rabbitmq
I am starting to see this same problem. Has anyone else encounter this?
ENVIRONMENT
AWX version: 9.2.0
AWX install method: docker on linux
Operating System: RHEL 7
Similar issue here, just that my jobs never resume. Started occuring with the upgrade to 9.2.0 for me. The job that gets stuck is usually my main workflow that runs every 20 minutes. Sometimes its a subworkflow for job slicing. Clicking cancel in AWX web doesn't do anything.
After a restart of the awx_task container I can cancel the job in AWX web and the next ~6 runs (2 hours) run fine. Then it gets stuck again at a random point in the workflow, where new jobs for the next template should be created. Since in my case it never hangs on a job that actually runs a playbook I don't think its related to hosts or tasks.
As for an example, my last running job (767795) started at log timestamp 19:11:47 then it got stuck.
The main workflow started at 19:00:21 UTC (log time) as job 767756.
https://pastebin.com/3E0WZWWb

AWX version: 9.2.0
AWX install method: Local Docker Compose
Operating System: Debian 10 with kernel 5.4 from backports
@xTrekStorex any chance you could run this in the task container?
awx-manage run_dispatcher --status
...and report back what it prints?
A bit of an extension on that question, if you already have shell. First run awx-manage shell_plus and then
def check_lock():
import time
from awx.main.utils.pglock import advisory_lock
time.sleep(1)
with advisory_lock('task_manager_lock', wait=False) as acquired:
return acquired
you can copy and paste functions, so you can past that and hit enter twice.
next run
[check_lock() for i in range(20)]
Expect it to take 20 seconds to run. This assures that we get a full cycle. If any entry returns True, then that indicates that this lock is not held up perpetually.
@xTrekStorex any chance you could run this in the task container?
awx-manage run_dispatcher --status...and report back what it prints?
2020-02-26 08:43:04,275 WARNING awx.main.dispatch checking dispatcher status for awx
listening on ['awx_private_queue [direct]', 'awx [direct]', 'tower_broadcast_all_awx [fanout]']
awx[pid:144] workers total=4 min=4 max=50
. worker[pid:58556] sent=5 finished=4 qsize=1 rss=121.879MB
- running a07fbd79-5a41-4572-977c-21198e9269a1 run_task_manager(*[])
. worker[pid:63835] sent=372 finished=372 qsize=0 rss=128.586MB [IDLE]
. worker[pid:64365] sent=140 finished=140 qsize=0 rss=129.656MB [IDLE]
. worker[pid:64373] sent=131 finished=131 qsize=0 rss=129.492MB [IDLE]
A bit of an extension on that question, if you already have shell. First run
awx-manage shell_plusand then
[...]
>>> [check_lock() for i in range(20)]
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]
If you need anything else I'd be happy to try things or post more logs etc.
I just tried killing the worker with the running job. From what I can see in AWX web this causes the next jobs in the workflow to spawn as pending but then gets stuck again once "run_task_manager" runs again:
bash-4.4# kill 58556
bash-4.4# awx-manage run_dispatcher --status
2020-02-26 09:13:18,360 WARNING awx.main.dispatch checking dispatcher status for awx
listening on ['awx_private_queue [direct]', 'awx [direct]', 'tower_broadcast_all_awx [fanout]']
awx[pid:144] workers total=3 min=4 max=50
. worker[pid:64851] sent=10 finished=10 qsize=0 rss=130.457MB [IDLE]
. worker[pid:64860] sent=8 finished=8 qsize=0 rss=130.570MB [IDLE]
. worker[pid:64862] sent=10 finished=10 qsize=0 rss=130.633MB [IDLE]
bash-4.4# awx-manage run_dispatcher --status
2020-02-26 09:13:44,632 WARNING awx.main.dispatch checking dispatcher status for awx
listening on ['awx_private_queue [direct]', 'awx [direct]', 'tower_broadcast_all_awx [fanout]']
awx[pid:144] workers total=4 min=4 max=50
. worker[pid:64851] sent=10 finished=10 qsize=0 rss=130.457MB [IDLE]
. worker[pid:64860] sent=8 finished=8 qsize=0 rss=130.570MB [IDLE]
. worker[pid:64862] sent=12 finished=11 qsize=1 rss=137.480MB
- running 44d2fc0d-58d8-4df6-a5a6-ca961221b7da run_task_manager(*[])
. worker[pid:64904] sent=1 finished=1 qsize=0 rss=138.359MB [IDLE]
FYI: I ran awx-manage run_dispatcher --reload before that (which didn't have any effect on the issue - as expected) so that is why the other workers have new PIDs
Restarting the entire awx_task container "fixes" the issue. The pending jobs start running shortly afterwards and the workflow completes after "running" for 12 hours - no timeout.
Ignore the failed status, this was due to a temporary issue with a play.

@xTrekStorex,
We're actively tracking what we think is a regression, and I think you're encountering a version of it. The next time you see this, could you:
awx-manage run_dispatcher --status and find the pid of the process stuck on run_task_manager. In your last comment, this pid would be pid:64862.yum install strace
strace -p 64862
yum install strace
gdb -p 64862
@xTrekStorex,
Also, is there any way you could share with us what your workflow looks like?
sosreport-rgogbabteca01-awxhang-2020-02-26-mugqzxz.tar.xz.zip.001.txt
sosreport-rgogbabteca01-awxhang-2020-02-26-mugqzxz.tar.xz.zip.002.txt
sosreport-rgogbabteca01-awxhang-2020-02-26-mugqzxz.tar.xz.zip.003.txt
sosreport-rgogbabteca01-awxhang-2020-02-26-mugqzxz.tar.xz.zip.004.txt
remove .txt for extract i have the same problem and push to you my sosreport
awx.dump.zip.001.txt
awx.dump.zip.002.txt
remove .txt for extract i have the same problem and push to you my dumpdb
@blaisemichel,
If you're still encountering a hung install, could you run the commands referenced above and share the output?
https://github.com/ansible/awx/issues/5617#issuecomment-591107161
https://github.com/ansible/awx/issues/5617#issuecomment-591405762
@xTrekStorex,
We're actively tracking what we think is a regression, and I think you're encountering a version of it. The next time you see this, could you:
- Run
awx-manage run_dispatcher --statusand find the pid of the process stuck onrun_task_manager. In your last comment, this pid would bepid:64862.- Install and run either (or both) of these tools, and share the output of that pid:
[...]
Will do as soon as it happens again. For now I encountered a different but maybe related issue of schedules not running anymore. I see a few schedulers waiting:
2020-02-26 17:50:54,983 WARNING awx.main.dispatch checking dispatcher status for awx
listening on ['awx_private_queue [direct]', 'awx [direct]', 'tower_broadcast_all_awx [fanout]']
awx[pid:150] workers total=18 min=4 max=50
. worker[pid:26066] sent=1 finished=0 qsize=1 rss=120.391MB
- running ee25366a-d3db-440b-94a5-5937d227dc7e gather_analytics(*[])
. worker[pid:106189] sent=1 finished=0 qsize=1 rss=125.023MB
- running d6669a75-d7cf-4b6f-bfb2-9747bc2332b8 awx_isolated_heartbeat(*[])
. worker[pid:285021] sent=1 finished=0 qsize=1 rss=125.820MB
- running 21007516-f287-4af1-b020-723194963400 awx_k8s_reaper(*[])
. worker[pid:298569] sent=2 finished=1 qsize=1 rss=128.562MB
- running ebe401c5-d295-48bd-8ac8-0419048f915d awx_periodic_scheduler(*[])
. worker[pid:341017] sent=1 finished=0 qsize=1 rss=126.910MB
- running 9e666314-40ca-4ab4-aec9-f6be038ccd4b awx_isolated_heartbeat(*[])
. worker[pid:383663] sent=1 finished=0 qsize=1 rss=128.758MB
- running da08c5ba-4ccd-4618-85ef-de6d3e094783 awx_isolated_heartbeat(*[])
. worker[pid:384184] sent=1 finished=0 qsize=1 rss=129.391MB
- running 5c4fb992-0f1d-4c73-adb6-f58e84350d93 awx_isolated_heartbeat(*[])
. worker[pid:384186] sent=1 finished=0 qsize=1 rss=129.391MB
- running 6ad1a202-6762-4f0e-8cec-bb38c08b435c gather_analytics(*[])
. worker[pid:385794] sent=1 finished=0 qsize=1 rss=129.508MB
- running 452cd4dc-489f-4e42-87a6-ae8da70197e8 awx_periodic_scheduler(*[])
. worker[pid:385894] sent=1 finished=0 qsize=1 rss=129.508MB
- running dcbde91e-c16f-4953-a2b3-2768e696f6a6 awx_k8s_reaper(*[])
. worker[pid:386251] sent=1 finished=0 qsize=1 rss=130.410MB
- running 87c70cce-554f-4723-b64c-850a94206afb awx_isolated_heartbeat(*[])
. worker[pid:386258] sent=1 finished=0 qsize=1 rss=130.410MB
- running d6d566f5-9e57-4e5b-ab16-957dfa961904 awx_k8s_reaper(*[])
. worker[pid:386925] sent=1 finished=0 qsize=1 rss=131.250MB
- running 0cacc96e-88be-4716-a7c5-11697dd31706 awx_periodic_scheduler(*[])
. worker[pid:387030] sent=1 finished=0 qsize=1 rss=131.258MB
- running c3db034d-070b-439c-9fd2-4efcda59fcb8 cluster_node_heartbeat(*[])
. worker[pid:390159] sent=1 finished=0 qsize=1 rss=133.598MB
- running e0f5f725-92fd-4357-86f6-33d9154b26dd cluster_node_heartbeat(*[])
. worker[pid:390316] sent=1 finished=0 qsize=1 rss=134.129MB
- running b5fc6193-a3a9-4691-8654-b54d4e9ad206 cluster_node_heartbeat(*[])
. worker[pid:390557] sent=1 finished=0 qsize=1 rss=133.867MB
- running 18e52063-8342-4dc0-a04b-6f2345b3d2cd awx_k8s_reaper(*[])
. worker[pid:390898] sent=1 finished=0 qsize=1 rss=135.133MB
- running 52ee922a-9e06-4b74-851c-0465bd204f88 awx_periodic_scheduler(*[])
Running strace on any of them returns
futex(0x5577f72e6060, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff
However I can run JTs manually for now.
@xTrekStorex,
Any chance you could run:
gdb -p <pid> on one of those and share the output?
@xTrekStorex,
Also, is there any way you could share with us what your workflow looks like?
Sure can.
Maybe some context first:
I import our vCenter into inventory (via a modified version of the contrib script adding vCenter tags).
My workflow chains a vCenter sync and my JTs which each are limited to a host group generated from vCenter tags during import. That way I control basic & continuous Ansible deployments via vCenter instead of AWX. Each JT has an error handling path calling itself again to handle plays that failed due to temporary issues like dpkg locks, DNS timeouts etc. If it still fails it terminates the workflow as failed due to missing error handling.

Is there a specific part of the workflow where you find it hangs? Or does it tend to be random?
@xTrekStorex,
Any chance you could run:
gdb -p <pid>on one of those and share the output?
Is there a specific part of the workflow where you find it hangs? Or does it tend to be random?
Totally random, sometimes nginx, sometimes docker, sometimes common. Can't say I've ever seen it with a pending vCenter sync though.
@xTrekStorex
Sorry, I gave slightly wrong instructions.
Once you get the (gdb) prompt, could you try running these commands (and sharing output):
bt
py-bt
Thank you so much 😄
@xTrekStorex
Sorry, I gave slightly wrong instructions.
Once you get the
(gdb)prompt, could you try running these commands (and sharing output):
bt
py-btThank you so much 😄
Unfortunately py-bt doesn't work due to missing debuginfos for python.
yum install python36-debug.x86_64 doesn't help and the official docs (https://wiki.python.org/moin/DebuggingWithGdb) with
yum install yum-utils
debuginfo-install glibc
yum install gdb python-debuginfo
fail with
No match for argument: python-debuginfo
Error: Unable to find a match: python-debuginfo
yum update doesn't solve this either.
Here is the bt output though: https://pastebin.com/wnkc9SQn
@xTrekStorex
Something like this might get you closer, depending on your platform:
dnf debuginfo-install -y --enablerepo='base-debuginfo' python36
@xTrekStorex
Something like this might get you closer, depending on your platform:
dnf debuginfo-install -y --enablerepo='base-debuginfo' python36
Unfortunately no
bash-4.4# dnf debuginfo-install -y --enablerepo='base-debuginfo' python36
error: rpmdb: damaged header #173 retrieved -- skipping.
enabling epel-debuginfo repository
error: rpmdbNextIterator: skipping h# 173 blob size(55540): BAD, 8 + 16 * il(78) + dl(54284)
Last metadata expiration check: 0:06:50 ago on Wed 26 Feb 2020 07:32:16 PM UTC.
Could not find debuginfo for package: python36-3.6.8-2.module_el8.1.0+245+c39af44f.x86_64
Could not find debuginfo for package: python36-3.6.8-2.module_el8.1.0+245+c39af44f.x86_64
No debuginfo packages available to install
Dependencies resolved.
Nothing to do.
Complete!
My platform is the ansible/awx_task:9.2.0 image from docker hub, which apparently is based on CentOS 8. The host itself is running Debian 10 but even if I install python3-dbg and run gdb there on PIDs matching awx-manage run_dispatcher in the host namespace I don't get py-bt.
The missing debuginfo packages seem to be a common issue from what I can see on Google. An alternative would be to build python and the gdb extension from source but I'd rather not 😄
After tinkering with an install for about an hour, I've reproduced something just like this, and here's what the Python trace looks like for me (mostly leaving this for my benefit):
(gdb) py-bt
Traceback (most recent call first):
<built-in method acquire of _thread.lock object at remote 0x7fd0ac08fd78>
File "<frozen importlib._bootstrap>", line 98, in acquire
File "<frozen importlib._bootstrap>", line 149, in __enter__
File "<frozen importlib._bootstrap>", line 968, in _find_and_load
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/transport/__init__.py", line 13, in supports_librabbitmq
import librabbitmq # noqa
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/connection.py", line 251, in _init_params
if transport == 'amqp' and supports_librabbitmq():
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/connection.py", line 199, in __init__
self._init_params(**params)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/kombu.py", line 17, in __init__
super(Connection, self).__init__(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/publish.py", line 89, in apply_async
with Connection(settings.BROKER_URL) as conn:
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/publish.py", line 67, in delay
return cls.apply_async(args, kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/utils/common.py", line 819, in <lambda>
connection.on_commit(lambda: run_task_manager.delay())
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 628, in on_commit
func()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/utils/common.py", line 819, in _schedule_task_manager
connection.on_commit(lambda: run_task_manager.delay())
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/utils/common.py", line 843, in schedule_task_manager
_schedule_task_manager()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/models/unified_jobs.py", line 1310, in signal_start
schedule_task_manager()
(frame information optimized out)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 62, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 86, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 147, in work_loop
self.perform_work(body, *args)
File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 73, in _launch
code = process_obj._bootstrap()
File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib64/python3.6/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/usr/lib64/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib64/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 67, in start
self.process.start()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 229, in up
worker.start()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 391, in up
return super(AutoscalePool, self).up()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 399, in write
--Type <RET> for more, q to quit, c to continue without paging--
self.up()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 103, in process_task
self.pool.write(queue, body)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/messaging.py", line 590, in <listcomp>
[callback(body, message) for callback in callbacks]
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/messaging.py", line 590, in receive
[callback(body, message) for callback in callbacks]
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/messaging.py", line 624, in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/amqp/channel.py", line 1615, in _on_basic_deliver
fun(msg)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
listener(*args)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
method_sig, payload, content,
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/amqp/method_framing.py", line 79, in on_frame
callback(channel, msg.frame_method, msg.frame_args, msg)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
return self.on_inbound_frame(frame)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
while not self.blocking_read(timeout):
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
return connection.drain_events(**kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/connection.py", line 323, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/mixins.py", line 197, in consume
conn.drain_events(timeout=safety_interval)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/kombu/mixins.py", line 175, in run
for _ in self.consume(limit=None, **kwargs):
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 111, in run
super(AWXConsumer, self).run(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 88, in handle
consumer.run()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/__init__.py", line 152, in manage
execute_from_command_line(sys.argv)
(frame information optimized out)
@xTrekStorex I'm having a lot of trouble reproducing this in a reliable fashion that allows me to observe and experiment with this issue, but it's a high priority to resolve it, and I have a hunch about what it might be related to in recent commit history.
Are you willing to apply a patch and run it for a day or so to see if your hanging issues go away?
If so, give this a whirl:
https://github.com/ansible/awx/pull/6093
cc @shanemcd @chrismeyersfsu @AlanCoding
@xTrekStorex I'm having a lot of trouble reproducing this in a reliable fashion that allows me to observe and experiment with this issue, but it's a high priority to resolve it, and I have a hunch about what it might be related to in recent commit history.
Are you willing to apply a patch and run it for a day or so to see if your hanging issues go away?
[...]
Done. Will get back to you asap
For posterity, I very much suspect this is actually related to a bug in cpython itself (https://bugs.python.org/issue38884) that we've begun encountering due to this recent PR of mine:
https://github.com/ansible/awx/pull/5837
...which moved our periodic heartbeat code (when we removed celery a few weeks ago) to a thread (instead of a distinct process). Details:
https://github.com/ansible/awx/pull/6093#issuecomment-591665229
@xTrekStorex I've done some testing on my end and am feeling good about my PR. If I hear back from you in the coming days (that you're unable to reproduce with my patch), I'll probably merge https://github.com/ansible/awx/pull/6093 and incorporate it into the next AWX release.
@xTrekStorex I've done some testing on my end and am feeling good about my PR. If I hear back from you in the coming days (that you're unable to reproduce with my patch), I'll probably merge #6093 and incorporate it into the next AWX release.
I don't want to jinx it but in the past 24 hours I have not encountered the issue again.
Everything ran and keeps running, schedules work - no complaints.
I don't intend to switch back to the official Docker image anytime soon so I'll just leave your PR running.
What I noticed though is that workers seem to be recreated a lot more often now.
When I still encountered the issue the sent and finished values per worker were in the 130-350 range. Now they are more like 10-90 and I didn't restart the task container (24h uptime).
I also just noticed the PIDs increased again so they are definitely new workers.
@xTrekStorex do you have something on a schedule that's running over, say, the weekend?
With the old 9.2.0 image, were you having trouble encountering the issue often (more than, say, once per day?)
I'm not ready to call this solved, but if you're convinced with a before/after, I'm likely to merge my PR early next week.
@xTrekStorex do you have something on a schedule that's running over, say, the weekend?
With the old 9.2.0 image, were you having trouble encountering the issue often (more than, say, once per day?)
I'm not ready to call this solved, but if you're convinced with a before/after, I'm likely to merge my PR early next week.
I encountered the issue every 2-6 hours.
There are no major changes planned for the weekend however my workflow runs continuously 24/7, every 20 minutes with a runtime of ~18 minutes to enforce configuration state (I moved to AWX from the Puppet world - it's a habit). So it will create and run plenty of jobs if that is what you're asking.
@xTrekStorex thanks for helping us test this, I'll check back in with you next week 👍
hello after update of friday my awx is already hang
(InteractiveConsole)
def check_lock():
... import time
... from awx.main.utils.pglock import advisory_lock
... time.sleep(1)
... with advisory_lock('task_manager_lock', wait=False) as acquired:
... return acquired
...
[check_lock() for i in range(20)]
[True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]
awx-manage run_dispatcher --status
strace: attach: ptrace(PTRACE_ATTACH, 35212): Operation not permitted
@blaisemichel,
I've got a PR which I suspect will resolve this issue:
https://github.com/ansible/awx/pull/6093
@xTrekStorex,
Any updates on your end after a weekend of running periodic jobs?
[...]
@xTrekStorex,Any updates on your end after a weekend of running periodic jobs?
All good. Not a single hung job.
All, is there a way to monitor the job stuck issue and restart the stuck task container?
You can monitor with API
https://"tower host URL"/api/v2/unified_jobs/?status=pending
not sure how to find out which pod has stucked container
All, is there a way to monitor the job stuck issue and restart the stuck task container?
@fischerdr Instead of an alert handler based workaround you could either apply the above mentioned patch manually like I did or downgrade to AWX 9.1.1 if that's an option for you until the fixed release becomes available
My recommendation would be what @xTrekStorex said - either apply the patch directly (it's just merged into devel a short while ago), or wait for the next release of AWX (I'm planning to cut a new minor release at some point in the next week with this fix).
I don't suspect that detecting and working around this bug is worth the effort, personally.
So 9.1.1 is ok to deploy, till new version is out?
@fischerdr to the best of my knowledge, this was a regression introduced in 9.2.0, and I've heard several people report that 9.1.1 doesn't suffer from this bug.
I have the same exact issue.
What I did to patch my container:
./patches/9.2.0 and add files from https://github.com/ansible/awx/pull/6093/filesdocker-compose volumes syntaxcommand to hook my how script before launching service# Patch awx 9.2.0
awx_version="$(awx-manage --version)"
echo "current awx version is: ${awx_version}"
if [[ "${awx_version}" == "9.2.0" ]]; then
echo "Patching awx because current version == 9.2.0"
# patch
cp -f /patches/9.2.0/awx/main/dispatch/periodic.py /var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/
cp -f /patches/9.2.0/awx/main/management/commands/run_dispatcher.py /var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/
fi
I just patched today, I'll see if it works with time
Edit Since I've patched I don't have problem anymore
Just a little more feedback:
My awx_task container was recreated last night and the issue returned again shortly after. Patching the (new) container again fixes it - again.
I could build a fixed image or modify the image command like @kakawait did to patch it persistently but since a fixed release is coming I just do the following (in case someone just wants to try the patch without any extra volumes or command overrides):
docker exec -it awx_task bash
yum install -y wget
wget https://raw.githubusercontent.com/ryanpetrello/awx/5364e7839724026ad4ae49120c4b38dd8ab24326/awx/main/management/commands/run_dispatcher.py -O /var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py
wget https://raw.githubusercontent.com/ryanpetrello/awx/5364e7839724026ad4ae49120c4b38dd8ab24326/awx/main/dispatch/periodic.py -O /var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/periodic.py
exit
docker kill awx_task
docker start awx_task
Any plans yet on a date for the next release that will include this fix?
Hey @tgates81 (and others),
I'm planning to release a new version of AWX tomorrow that will address this issue:
As we were reproducing this on a regular basis on internal jobs and many upstream users as well, I am going to call this verified by the fact that it has been solved for all parties. Closing
Most helpful comment
Just a little more feedback:
My awx_task container was recreated last night and the issue returned again shortly after. Patching the (new) container again fixes it - again.
I could build a fixed image or modify the image command like @kakawait did to patch it persistently but since a fixed release is coming I just do the following (in case someone just wants to try the patch without any extra volumes or command overrides):