Awx: Replace clustered RabbitMQ with something simpler

Created on 4 Dec 2019  Â·  17Comments  Â·  Source: ansible/awx

ISSUE TYPE
  • Feature Idea
SUMMARY

Replace our clustered implementation of RabbitMQ with something that is easier to understand and operate (and that matches AWX's needs better).

AWX currently makes extensive use of clustered RabbitMQ:

  1. As a form of direct topic-based RPC for dispatching jobs (e.g., playbook runs) to underlying AWX instances. This process involves a periodic scheduler that wakes up, finds work to do, picks an available node with capacity, and places a message on its queue, which is treated as a sort of per-instance "task queue" ala https://python-rq.org or https://docs.celeryproject.org/en/stable/. Certain special messages (which generally are used to perform internal housekeeping tasks in AWX) are "broadcast" to all nodes instead of following a direct RPC topology.

  2. As a buffer for processing job output (Ansible callback events/stdout) via AWX's "callback receiver" process running on each AWX instance.

  3. As a backend for AWX's websocket support for in-browser streaming stdout and live job status updates. Our websocket implementation is based on a custom AMQP-specific ASGI backend which we wrote and maintain, https://github.com/ansible/asgi_amqp/. As time has marched on, and the upstream channels library has drastically changed its architecture in anticipation of native async support in python3, it has become an increased maintenance burden for us to continue to support a custom backend specific to AMQP (especially when it appears that pretty much everybody upstream that uses Channels is just using Redis).

When we originally designed this system years ago, we optimized as heavily as possible for data integrity and safety. But in the scenarios described above, the data we manage under this system is largely ephemeral. In the most extreme cases, it doesn't persist beyond the lifetime of a running playbook. In other words, if a node running a playbook were to suddenly go offline, we can't really recover from that sort of scenario anyways without re-running the playbook. Similarly, if messages are lost in flight in _rare_ circumstances, you can always just relaunch a playbook.

We're paying a heavy cost for this cluster-wide data mirroring/replication. Historically, we've heard from many of our users that:

  • RabbitMQ clustering doesn't work well in environments unless cluster peers have very low latency. In fact, this is a limitation called out repeatedly in RabbitMQ's clustering documentation. It's an aspect of RabbitMQ clustering that we knew about when we chose it years ago, but it's turned out to be much more painful than we anticipated.

  • Especially in environments with unreliable networks, RabbitMQ can be very difficult to administer and troubleshoot. In particular, we regularly have users that report network partitioning scenarios that require manual intervention via manual erlang and/or RabbitMQ-specific remediation.

    • When cluster nodes disappear for prolonged periods of time (hours, days), we've seen many situations where RabbitMQ clustering just isn't able to recover on its own, which causes a myriad of issues when the node returns. Detecting and remediating this often leads to service outages.

    • The firewall/security group requirements for inter-node replication is a common source of confusion for users, and failing to do it properly can result in situations where adding a node to an existing cluster fails and results in an unanticipated cluster-wide outage.

What we've come to realize is that this architecture is likely not worth the operational and architectural cost we're paying.

Long-term, we'd prefer to move to a model that does not require a control plane that relies on a clustered message bus, but instead one where members of the control plane can largely drop off with minimal effect beyond lowered total execution capacity. RabbitMQ clustering explicitly is not reliable across AZs, and especially not regions, and while newer topologies we're considering don't absolve of this entirely, our goal is to move AWX to a model which is much more forgiving of low-latency networks in general.

In the next major version of AWX, we'd like to investigate replacing RabbitMQ with a combination of features provided by Redis (a new dependency) and Postgres itself. This would most likely look something like this:

  • Dispatching tasks is still treated as “direct RPC”. In other words, when the task manager runs, it picks one cluster node with capacity, and assigns it as the “execution node”. Dispatcher processes running on every node listen for “tasks” via PostgreSQL channel notification support (https://www.postgresql.org/docs/10/sql-notify.html)

  • Events emitted from playbooks are no longer sent to a distributed message queue (previously RabbitMQ), but instead a local redis running on each node. Callback receivers on each node listen for events on that node and persist them into the database.

  • When an event is persisted to the database by the callback receiver, it also is broadcasted to all cluster peers via ASGI. In this way, if a playbook runs on Node A, users connected to Daphne on Nodes B, C, and D will receive a broadcast of these events and see the output in their browser tabs.

Longer term, introducing Redis would potentially allow us to also lose our dependence on memcached (so in other words, we might be able to swap out two dependencies, and replace them with one single new dependency).

api installer high enhancement

Most helpful comment

@ryanpetrello thanks.. A completely new build platform, CentOS8/RHEL8 support and the Redis changes are in the works.. I hope for a release after easter

All 17 comments

some additional work items under this:

  • [ ] include an awx-manage based health check for the redis system
  • [ ] include ^ health check in the sos report as well as find way to depend on/enable the redis sos report https://github.com/ansible/awx/blob/devel/tools/sosreport/tower.py so we can get redis logs in sos report
  • [ ] send one final unsubscribed message back to ws client when tower ACKs the unsubscribe request so we can know when we have actually been unsubscribed

cc @MrMEEE in case you haven't seen this yet

also: https://groups.google.com/forum/#!topic/awx-project/lRnm2vB1oEQ

@ryanpetrello Thanks for the heads-up.. I will follow this closely :)

@MrMEEE the biggest change is "install and configure Redis, not RabbitMQ". Also, we lost a number of RabbitMQ toggle-ables in the installer.

You may be interested in any changes under ./installer in the PR:

https://github.com/ansible/awx/pull/6034/files#diff-bfa9126dc8059138bf7554d741cb6a5d
https://github.com/ansible/awx/pull/6034/files#diff-fabe539e09ace3de67486bba9b5b3be6
https://github.com/ansible/awx/pull/6034/files#diff-0091f8a83b63dafea8313c794ba726b3

Extensive testing was done before merge to ensure the installation was working as expected and that replacing rabbitmq with redis would not introduce regressions.

With that said, we can consider this as being verified and any further polishing will be handled by separated issues (we already got some of those already opened).

Just a heads up @MrMEEE - 10.0.0 is out now, and includes this change.

@ryanpetrello thanks.. A completely new build platform, CentOS8/RHEL8 support and the Redis changes are in the works.. I hope for a release after easter

Hey there, I installed AWX on kubernetes after redis was introduced , the installation compled with no issue but when i access the UI and try to do anything on the UI i get error related to api. i am attaching couple of screenshot of the error i am getting.
Screen Shot 2020-03-31 at 3 56 59 AM
Screen Shot 2020-03-31 at 3 58 48 AM

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

Also, could you file a _new_ issue describing what you're encountering? Thanks.

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

Hi I got same error when I am upgrading ansible tower from 7.0.0 to 11.0.0 through Docker-Compose file, now when I run the docker-compose up command, I am getting below error. please help me what is the mistake I am doing hear in the configuration.

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://)
task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected)
task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276
task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | Traceback (most recent call last):

below is my compose file:
version: '2'
services:

web:
image: ansible/awx_web:11.0.0
container_name: awx_web
depends_on:
- redis
- memcached
ports:
- "80:8052"
- "443:8443"
hostname: awxweb
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
- "/var/lib/awx/projects/nginx.conf:/etc/nginx/nginx.conf:rw"

dns:
  - 10.204.226.77
  - 10.204.226.111
environment:
  http_proxy:
  https_proxy:
  no_proxy:

task:
image: ansible/awx_task:11.0.0
container_name: awx_task
depends_on:
- redis
- memcached
- web
hostname: awx
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
dns:
- 10.204.226.77
- 10.204.226.111
environment:
http_proxy:
https_proxy:
no_proxy:
redis:
image: redis:6.0-rc4-alpine3.11
container_name: tools_redis_1
environment:
REDIS_PASSWORD: password
ports:
- "6379:6379"
volumes:
- "/var/lib/awx/redis.conf:/usr/local/etc/redis/redis.conf"
- "/var/lib/awx/redis_socket_standalone:/var/run/redis/"
command: ["/usr/local/etc/redis/redis.conf"]
memcached:
image: "memcached:alpine"
container_name: awx_memcached
restart: unless-stopped
environment:
http_proxy:
https_proxy:
no_proxy:

and In environment.sh file I have done the below configuration:
REDIS_URL=redis://ansible-ro.rbkm0e.ng.0002.use1.cache.amazonaws.com:6379
REDIS_PORT=6379
REDIS_SOCKET=/var/lib/awx/redis.sock
REDIS_PASSWORD=password

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://)
task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected)
task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276
task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | Traceback (most recent call last):

@rkatta22 making comments on a closed ticket is not going to receive a reply. You need to file a _new_ issue if you think you've encountered a bug.

@rkatta22 you haven't encountered a bug - you just have old configuration of some sort laying around pointed at an old AMQP connection string from a prior install (which is no longer valid):

('Unsupported URI scheme', 'amqp')

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx
https://groups.google.com/forum/#!forum/awx-project

Hi Team I am working on ansible tower upgrade from 7 to 10, I can see the containers are up but in awx web contaner log I see below error, could some one please help me.

task_1 | Traceback (most recent call last):
task_1 | File "/usr/bin/awx-manage", line 8, in
task_1 | sys.exit(manage())
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/__init__.py", line 152, in manage
task_1 | execute_from_command_line(sys.argv)
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
task_1 | utility.execute()
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
task_1 | self.fetch_command(subcommand).run_from_argv(self.argv)
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
task_1 | self.execute(args, *cmd_options)
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
task_1 | output = self.handle(args, *options)
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_callback_receiver.py", line 26, in handle
task_1 | consumer.run()
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 119, in run
task_1 | queue = redis.Redis.from_url(settings.BROKER_URL)
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 673, in from_url
task_1 | connection_pool = ConnectionPool.from_url(url, db=db, **kwargs)
task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 1046, in from_url
task_1 | 'schemes (%s)' % valid_schemes)
task_1 | ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://)
task_1 | 2020-05-13 13:01:44,167 INFO exited: callback-receiver (exit status 1; not expected)
task_1 | 2020-05-13 13:01:45,169 INFO spawned: 'callback-receiver' with pid 907
task_1 | 2020-05-13 13:01:46,171 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

web_1 | 2020-05-13 13:01:46,669 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')

----below is my docker-compose file---->
version: '2'
services:

web:
image: ansible/awx_web:10.0.0
container_name: awx_web
depends_on:
- redis
- memcached
ports:
- "80:8052"
- "443:8443"
hostname: awxweb
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
- "/var/lib/awx/projects/nginx.conf:/etc/nginx/nginx.conf:rw"

dns:
  - 10.204.226.77
  - 10.204.226.111
environment:
  http_proxy:
  https_proxy:
  no_proxy:

task:
image: ansible/awx_task:10.0.0
container_name: awx_task
depends_on:
- redis
- memcached
- web
hostname: awx
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
dns:
- 10.204.226.77
- 10.204.226.111
environment:
http_proxy:
https_proxy:
no_proxy:
redis:
image: redis:6.0-rc4-alpine3.11
container_name: tools_redis_1
environment:
REDIS_PASSWORD: password
ports:
- "6379:6379"
volumes:
- "/var/lib/awx/redis.conf:/usr/local/etc/redis/redis.conf"
- "/var/lib/awx/redis_socket_standalone:/var/run/redis/"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
command: ["/usr/local/etc/redis/redis.conf"]
memcached:
image: "memcached:alpine"
container_name: awx_memcached
restart: unless-stopped
environment:
http_proxy:
https_proxy:

no_proxy:

below is my credentials.py file configuration---->
DATABASES = {
'default': {
'ATOMIC_REQUESTS': True,
'ENGINE': 'django.db.backends.postgresql',
'NAME': "awx",
'USER': "awx",
'PASSWORD': "awxpass1",
'HOST': "awx-tower-upgrade.cnectdraqndy.us-east-1.rds.amazonaws.com",
'PORT': "5432",
}
}

BROKER_URL = 'amqp://{}:{}@{}:{}/{}'.format(
"guest",
"awxpass",
"redis",
"5672",
"awx")

CHANNEL_LAYERS = {
'default': {'BACKEND': 'asgi_amqp.AMQPChannelLayer',
'ROUTING': 'awx.main.routing.channel_routing',
'CONFIG': {'url': BROKER_URL}}
}

CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
'LOCATION': '{}:{}'.format("memcached", "11211")
},
'ephemeral': {
'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
},

}

below is my environment.sh file --->
DATABASE_USER=awx
DATABASE_NAME=awx
DATABASE_HOST=awx-tower-upgrade.cnectdraqndy.us-east-1.rds.amazonaws.com
DATABASE_PORT=5432
DATABASE_PASSWORD=awxpass1
MEMCACHED_HOST=memcached
MEMCACHED_PORT=11211
RABBITMQ_HOST=rabbitmq
RABBITMQ_PORT=5672
AWX_ADMIN_USER=admin
AWX_ADMIN_PASSWORD=password

ANSIBLE_REDIS_HOST=ansible-tower.rbkm0e.ng.0001.use1.cache.amazonaws.com:6379

REDIS_URL="redis://ansible-tower-ro.rbkm0e.ng.0001.use1.cache.amazonaws.com:6379"
REDIS_PORT=6379
REDIS_SOCKET=/var/lib/awx/redis.sock
REDIS_PASSWORD=password

@rkatta22,

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx
https://groups.google.com/forum/#!forum/awx-project

Was this page helpful?
0 / 5 - 0 ratings