Node can't join to cluster after lost network connection to a postgres database.
1) Setup HA AWX cluster with 3+ nodes and external Postgres database
2) Shutdown network on one of nodes and wait 120 sec(grace_period)
3) Look at work nodes log, you will see, that node without network was removed from cluster:
Jul 4 17:16:59 1.awx.node.dc2 dispatcher[207]: 2019-07-04 14:16:59,220 INFO awx.main.tasks Host 1.awx.node.dc1 Automatically Deprovisioned.
And from postgres database:
awx=> SELECT id FROM main_instance;
id
-----
543
550
(2 rows)
4) Fix network on node and look at it logs:
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: 2019-07-04 14:20:39,433 ERROR awx.main.dispatch failed to write inbound message
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: Traceback (most recent call last):
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 388, in write
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: self.cleanup()
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 373, in cleanup
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: reaper.reap(excluded_uuids=running_uuids)
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 35, in reap
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: me = instance or Instance.objects.me()
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 88, in me
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: raise RuntimeError("No instance found with the current cluster host id")
Jul 4 17:20:39 1.awx.node.dc1 dispatcher[505]: RuntimeError: No instance found with the current cluster host id
After restoring network connection to the database, node successfully rejoins to cluster.
Node never rejoin to cluster without instance restarting
Node can't return to cluster, because it calls function cleanup from pool.py on each heartbeat.
cleanup calls reaper.reap() and it fails, because can't get instance id(awx delete node from database at reproduce step 3):
me = instance or Instance.objects.me()
I created a pull request, with probably fix: https://github.com/ansible/awx/pull/4268
The problem of reap is that the function wants the working instance.
But there may be cases when the instance, for some reason, is automatically deprovisioned. If that happens then the instance can't be provisioned back.
I found two places where reap breaks automatic node provisioning.
We don't need to call cleanup if instance is not in cluster.
https://github.com/ansible/awx/blob/2aa32f61f899836de8b8f502caeafbb8c030aed9/awx/main/dispatch/pool.py#L390-L392
The second place where reap may be called when instance is not in cluster (in case of automatic deprovisioning) is here:
https://github.com/ansible/awx/blob/2aa32f61f899836de8b8f502caeafbb8c030aed9/awx/main/management/commands/run_dispatcher.py#L123
so if you try to restart dispatcher it will not start.
@byumov This is happening to me on a single-node setup as well:
2019-07-17 08:47:27,611 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 11, in <module>
load_entry_point('awx==6.0.0.0', 'console_scripts', 'awx-manage')()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/__init__.py", line 140, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 330, in execute
output = self.handle(*args, **options)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 123, in handle
reaper.reap()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 36, in reap
me = instance or Instance.objects.me()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id
Task container logs are full of this. And while I can navigate the UI, I cannot run any jobs.
@megakoresh Root cause is while doing a backup on a Tower instance, it is not excluding rabbitmq.py and hence while doing a restore on a different Ansible Tower instance it restores the original rabbitmq.py, which breaks the rabbitmq clustering.
use this command fix it
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage create_preload_data"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage provision_instance --hostname=$(hostname)"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"
systemctl restart awx-cbreceiver
systemctl restart awx-dispatcher
systemctl restart awx-channels-worker
systemctl restart awx-daphne
systemctl restart awx-web
@bjmingyang Root cause is hardcoded hostnames in configuration files, namely settings.py. Changing awx_task_hostname inventory variable changes service discovery hostname, while settings still refer to awx. This breaks the installation. And any solution that involves poking around in a running container is not a solution at all. This must be fixed properly.
I'm having the same problem on awx 6.1.0.
Running on Openshift, I need to restart pods in my awx cluster frequently because they can't rejoin the cluster themselves.
At its core, this issue can be condensed down to a very simple reproduction:
Instance from the database (Instance.objects.first().delete())....
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id
The practical scenario where you'll see this (as described in this issue) is in a k8s/OpenShift deployment with multiple pods (in this environment, settings.AWX_AUTO_DEPROVISION_INSTANCES = True). When a node goes missing (for any number of reasons) for (by default) 120s, the record for that node is removed from the main_instances table:
https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L444
At a later point in time, when its connectivity is restored, the dispatcher is _still running_, and so we see the RuntimeError: No instance found with the current cluster host id error. The appropriate change here would be to update the periodic cleanup/reaping process to _detect_ a missing instance record and automatically re-perform auto-registration.
Thanks for fixing this @ryanpetrello, we've been running in to this a lot on our kubernetes cluster.
Any chance of baking and publishing a new awx image once your PR gets merged please?
👋 @grahamneville thank @byumov, he figured out what was up and contributed the fix.
We have a few features landing in AWX soon, and we intend to cut a new release at some point after that (which will include this fix).