Awx: Error with rejoining node to cluster after lost connection to postgres

Created on 9 Jul 2019 · 8Comments · Source: ansible/awx

ISSUE TYPE

Bug Report

SUMMARY

Node can't join to cluster after lost network connection to a postgres database.

ENVIRONMENT

AWX version: 3.0.1
But bug still preset at 6.0.0

STEPS TO REPRODUCE

1) Setup HA AWX cluster with 3+ nodes and external Postgres database
2) Shutdown network on one of nodes and wait 120 sec(grace_period)
3) Look at work nodes log, you will see, that node without network was removed from cluster:

Jul  4 17:16:59 1.awx.node.dc2 dispatcher[207]: 2019-07-04 14:16:59,220 INFO     awx.main.tasks Host 1.awx.node.dc1 Automatically Deprovisioned.

And from postgres database:

awx=> SELECT id FROM main_instance;
 id
-----
 543
 550
(2 rows)

4) Fix network on node and look at it logs:

Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: 2019-07-04 14:20:39,433 ERROR    awx.main.dispatch failed to write inbound message
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: Traceback (most recent call last):
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 388, in write
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: self.cleanup()
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 373, in cleanup
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: reaper.reap(excluded_uuids=running_uuids)
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 35, in reap
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: me = instance or Instance.objects.me()
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 88, in me
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: raise RuntimeError("No instance found with the current cluster host id")
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: RuntimeError: No instance found with the current cluster host id

EXPECTED RESULTS

After restoring network connection to the database, node successfully rejoins to cluster.

ACTUAL RESULTS

Node never rejoin to cluster without instance restarting

ADDITIONAL INFORMATION

Node can't return to cluster, because it calls function cleanup from pool.py on each heartbeat.
cleanup calls reaper.reap() and it fails, because can't get instance id(awx delete node from database at reproduce step 3):

me = instance or Instance.objects.me()

I created a pull request, with probably fix: https://github.com/ansible/awx/pull/4268

api medium bug

Source

byumov

👍1

All 8 comments

The problem of reap is that the function wants the working instance.

https://github.com/ansible/awx/blob/2aa32f61f899836de8b8f502caeafbb8c030aed9/awx/main/dispatch/reaper.py#L32-L36

But there may be cases when the instance, for some reason, is automatically deprovisioned. If that happens then the instance can't be provisioned back.

I found two places where reap breaks automatic node provisioning.

https://github.com/ansible/awx/blob/2aa32f61f899836de8b8f502caeafbb8c030aed9/awx/main/dispatch/pool.py#L377

We don't need to call cleanup if instance is not in cluster.
https://github.com/ansible/awx/blob/2aa32f61f899836de8b8f502caeafbb8c030aed9/awx/main/dispatch/pool.py#L390-L392

The second place where reap may be called when instance is not in cluster (in case of automatic deprovisioning) is here:
https://github.com/ansible/awx/blob/2aa32f61f899836de8b8f502caeafbb8c030aed9/awx/main/management/commands/run_dispatcher.py#L123

so if you try to restart dispatcher it will not start.

YuriGrigorov on 9 Jul 2019

👍1

@byumov This is happening to me on a single-node setup as well:

2019-07-17 08:47:27,611 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 11, in <module>
    load_entry_point('awx==6.0.0.0', 'console_scripts', 'awx-manage')()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/__init__.py", line 140, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 123, in handle
    reaper.reap()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 36, in reap
    me = instance or Instance.objects.me()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

Task container logs are full of this. And while I can navigate the UI, I cannot run any jobs.

megakoresh on 17 Jul 2019

@megakoresh Root cause is while doing a backup on a Tower instance, it is not excluding rabbitmq.py and hence while doing a restore on a different Ansible Tower instance it restores the original rabbitmq.py, which breaks the rabbitmq clustering.

use this command fix it

sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage create_preload_data"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage provision_instance --hostname=$(hostname)"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"

systemctl restart awx-cbreceiver
systemctl restart awx-dispatcher
systemctl restart awx-channels-worker
systemctl restart awx-daphne
systemctl restart awx-web

bjmingyang on 26 Jul 2019

@bjmingyang Root cause is hardcoded hostnames in configuration files, namely settings.py. Changing awx_task_hostname inventory variable changes service discovery hostname, while settings still refer to awx. This breaks the installation. And any solution that involves poking around in a running container is not a solution at all. This must be fixed properly.

megakoresh on 26 Jul 2019

I'm having the same problem on awx 6.1.0.
Running on Openshift, I need to restart pods in my awx cluster frequently because they can't rejoin the cluster themselves.

tamirshaul on 17 Sep 2019

At its core, this issue can be condensed down to a very simple reproduction:

Install single-node AWX (any deployment method).
Once everything is running, delete the Instance from the database (Instance.objects.first().delete()).
Observe tracebacks like this one forever:

...
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

The practical scenario where you'll see this (as described in this issue) is in a k8s/OpenShift deployment with multiple pods (in this environment, settings.AWX_AUTO_DEPROVISION_INSTANCES = True). When a node goes missing (for any number of reasons) for (by default) 120s, the record for that node is removed from the main_instances table:

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L444

At a later point in time, when its connectivity is restored, the dispatcher is _still running_, and so we see the RuntimeError: No instance found with the current cluster host id error. The appropriate change here would be to update the periodic cleanup/reaping process to _detect_ a missing instance record and automatically re-perform auto-registration.

ryanpetrello on 27 Sep 2019

Thanks for fixing this @ryanpetrello, we've been running in to this a lot on our kubernetes cluster.
Any chance of baking and publishing a new awx image once your PR gets merged please?

grahamneville on 27 Sep 2019

👋 @grahamneville thank @byumov, he figured out what was up and contributed the fix.

We have a few features landing in AWX soon, and we intend to cut a new release at some point after that (which will include this fix).

ryanpetrello on 27 Sep 2019

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

"Invalid search term entered. GET returned: 400 無効な credential_type ID: NaN" when click CREDENTIAL at Templates (LANG Japanese only)

kakkotetsu · 3Comments

RFE: Add 'doas' in the drop down list of privilege escalation methods in Credential.

IshwarKanse · 3Comments

Add examples in documentation on how to use awxkit from Python

beenje · 3Comments

Error upgrading from 1.0.2.337 to 1.0.2.356

shortsteps · 3Comments

UI : Workflow Visualizer with more informations

augabet · 3Comments