After an upgrade from 11.2.0 to 12.0.0 the awx_web container can no longer reach the awx_task container. The consequence is that there are plenty of errors in the UI and some things just don't work.
Upgrade from 11.2.0 to 12.0.0 by running the newer install playbook with appropriate inventory.
The upgrade to be successful and that the containers are able to talk to each other.
UI not fully working with lots of GET: -1 errors.
Errors in awx_web logs:
2020-06-20 00:14:05,041 WARNING awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Name or service not known]'.
2020-06-20 00:14:05,043 DEBUG awx.main.wsbroadcast Connection from awx to awxtask attempt number 10885.
Manually adding an entry in the /etc/hosts file for the awxtask container results in the following error:
2020-06-22 12:46:04,456 WARNING awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Connect call failed ('172.18.0.5', 443)]'.
2020-06-22 12:46:04,458 DEBUG awx.main.wsbroadcast Connection from awx to awxtask attempt number 54201.
The task container is reachable via awx_task since that's the actual name of the container, but that's besides the point since renaming it to awxtask won't solve the problem as seen above.
I'm not totally sure what's up here @zigaSRC, but I'm not able to reproduce this in my vanilla 12.0.0 and 13.0.0 AWX local Docker installs. These containers definitely must be able to reach each other to broadcast stdout messages.
Can you share your inventory?
Sadly I can't share our inventories but if there's anything you would like me to check just let me know. Besides I don't think it's related to that, since it's not tied to running any playbook or inventory sync, but you probably know best.
BTW I upgraded to 13.0.0 since you mentioned it, but that didn't help the issue:
Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/libexec/platform-python"
},
"changed": false,
"elapsed": 0,
"match_groupdict": {},
"match_groups": [],
"path": null,
"port": 5432,
"search_regex": null,
"state": "started"
}
Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | SUCCESS => {
"ansible_facts": {
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 83)
spawned uWSGI worker 1 (pid: 87, cores: 1)
spawned uWSGI worker 2 (pid: 88, cores: 1)
spawned uWSGI worker 3 (pid: 89, cores: 1)
spawned uWSGI worker 4 (pid: 90, cores: 1)
spawned uWSGI worker 5 (pid: 91, cores: 1)
WSGI app 0 (mountpoint='') ready in 17 seconds on interpreter 0x1df9530 pid: 87 (default app)
[pid: 87|app: 0|req: 1/1] 172.25.7.90 () {56 vars in 993 bytes} [Mon Jun 29 06:15:15 2020] GET / => generated 11460 bytes in 4402 msecs (HTTP/1.1 200) 5 headers in 169 bytes (1 switches on core 0)
WSGI app 0 (mountpoint='') ready in 21 seconds on interpreter 0x1df9530 pid: 88 (default app)
WSGI app 0 (mountpoint='') ready in 22 seconds on interpreter 0x1df9530 pid: 91 (default app)
2020-06-29 06:15:22,077 INFO daphne.cli Starting server at tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,077 INFO Starting server at tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,083 INFO daphne.server HTTP/2 support not enabled (install the http2 and tls Twisted extras)
2020-06-29 06:15:22,083 INFO HTTP/2 support not enabled (install the http2 and tls Twisted extras)
2020-06-29 06:15:22,083 INFO daphne.server Configuring endpoint tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,083 INFO Configuring endpoint tcp:port=8051:interface=127.0.0.1
2020-06-29 06:15:22,102 INFO daphne.server Listening on TCP address 127.0.0.1:8051
2020-06-29 06:15:22,102 INFO Listening on TCP address 127.0.0.1:8051
WSGI app 0 (mountpoint='') ready in 24 seconds on interpreter 0x1df9530 pid: 89 (default app)
WSGI app 0 (mountpoint='') ready in 25 seconds on interpreter 0x1df9530 pid: 90 (default app)
[pid: 91|app: 0|req: 1/2] 172.25.7.90 () {54 vars in 860 bytes} [Mon Jun 29 06:15:21 2020] GET /api/ => generated 21204 bytes in 2847 msecs (HTTP/1.1 200) 11 headers in 400 bytes (1 switches on core 0)
2020-06-29 06:15:25,387 WARNING awx.main.wsbroadcast Adding {'awxtask'} to websocket broadcast list
2020-06-29 06:15:25,394 DEBUG awx.main.wsbroadcast Connection from awx to awxtask attempt number 0.
2020-06-29 06:15:25,730 WARNING awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Name or service not known]'.
2020-06-29 06:15:25,734 DEBUG awx.main.wsbroadcast Connection from awx to awxtask attempt number 1.
[pid: 89|app: 0|req: 1/3] 172.25.7.90 () {54 vars in 906 bytes} [Mon Jun 29 06:15:24 2020] GET /api/v2/auth/ => generated 2 bytes in 1751 msecs (HTTP/1.1 200) 10 headers in 285 bytes (1 switches on core 0)
[pid: 90|app: 0|req: 1/4] 172.25.7.90 () {56 vars in 912 bytes} [Mon Jun 29 06:15:24 2020] GET /api/ => generated 21204 bytes in 2314 msecs (HTTP/1.1 200) 11 headers in 400 bytes (1 switches on core 0)
[pid: 89|app: 0|req: 2/5] 172.25.7.90 () {56 vars in 918 bytes} [Mon Jun 29 06:15:26 2020] GET /api/v2/ => generated 1688 bytes in 570 msecs (HTTP/1.1 200) 10 headers in 288 bytes (1 switches on core 0)
2020-06-29 06:15:30,750 WARNING awx.main.wsbroadcast Connection from awx to awxtask failed: 'Cannot connect to host awxtask:443 ssl:False [Name or service not known]'.
2020-06-29 06:15:30,752 DEBUG awx.main.wsbroadcast Connection from awx to awxtask attempt number 2.
Edit: We could talk on IRC to see if we can figure something out. Just let me know where I can reach you.
Have you customized your container names or hostnames in some way? With a vanilla 13.0.0 install, I'm unable to reproduce this. Where is awxtask coming from?
~/dev/awx/installer docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55e39345ced2 ansible/awx:13.0.0 "tini -- /usr/bin/la…" 4 minutes ago Up 4 minutes 8052/tcp awx_task
8eea8e20de10 ansible/awx:13.0.0 "tini -- /bin/sh -c …" 4 minutes ago Up 4 minutes 0.0.0.0:80->8052/tcp awx_web
3faf2fe40ff6 postgres:10 "docker-entrypoint.s…" 4 minutes ago Up 4 minutes 5432/tcp awx_postgres
429c8448aa12 redis "docker-entrypoint.s…" 4 minutes ago Up 4 minutes 6379/tcp awx_redis
~/dev/awx/installer docker exec -it 55e39345ced2 hostname
awx
~/dev/awx/installer docker exec -it 8eea8e20de10 hostname
awxweb
~/dev/awx/installer docker exec -it 55e39345ced2 bash
bash-4.4# awx-manage dbshell
now exiting InteractiveConsole...
bash-4.4# awx-manage dbshell
psql (10.6, server 10.13 (Debian 10.13-1.pgdg90+1))
Type "help" for help.
awx=# SELECT hostname FROM main_instance;
hostname
----------
awx
(1 row)
Yes, but just through the install playbook or rather inventory. It contains the following changes to hostnames:
awx_task_hostname=awx
awx_web_hostname=ansible.example.com
As such the awx_web hostname is set to ansible. The task container has it's default awx hostname.
I have not changed the container names in any way and have not touched anything outside the install inventory file (except after the errors for debugging and trying to solve the connectivity problems).
awx=# SELECT hostname FROM main_instance;
hostname
----------
awxtask
awx
(2 rows)
I have no idea where the awxtask is comming from...
Can you share this?
SELECT * FROM main_instance;
Sorry for the late response:
awx=# SELECT * FROM main_instance;
id | uuid | hostname | created | modified | capacity | version | last_isolated_check | capacity_adjustment | cpu | memory | cpu_capacity | mem_capacity | enabled | managed_by_policy | ip_address
----+--------------------------------------+----------+-------------------------------+-------------------------------+----------+---------+---------------------+---------------------+-----+------------+--------------+--------------+---------+-------------------+------------
1 | 00000000-0000-0000-0000-000000000000 | awxtask | 2020-01-13 10:21:13.472856+00 | 2020-01-13 10:21:13.472944+00 | 0 | | | 1.00 | 0 | 0 | 0 | 0 | t | t |
2 | 00000000-0000-0000-0000-000000000000 | awx | 2020-01-13 10:38:43.953606+00 | 2020-07-03 05:36:28.104388+00 | 17 | 13.0.0 | | 1.00 | 2 | 3973791744 | 8 | 17 | t | t |
(2 rows)
@zigaSRC,
I'm not really sure why you've got that second instance in there called awxtask, but I'd just delete that row entirely; the modified time shows that it hasn't actually had a successful heartbeat since Jan 13th, e.g.,
DELETE FROM main_instance WHERE hostname='awxtask'
Can't really do that since it would violate foreign key restraints in the database.
Using awx-manage deprovision_instance --hostname=awxtask did work though and the other instance is gone. There are no more errors in the logs which is great. Hopefully this fixes the issues we've been having after the upgrade. We'll just have to wait and see...
For now: Thanks for the help!
@zigaSRC yep, deprovision_instance works. I expect your errors to go away - you had an old instance laying around that didn't reflect reality. I'm unsure how it _got there_, but I expect it was just something we goofed up in a prior/older version of AWX that's now lost to the sands of time 😄.
Let me know if you spot any other issues.
The problems we were having are still there so that wasn't the root cause sadly. I created another issue since it apparently doesn't relate to the connectivity issue discussed here. I will just include the link here in case it has any relevance.
Hey @zigaSRC I get those pretty regularly, too, and I'm not sure what causes them (@mabashian, @jakemcdermott or @marshmalien might know?)
That said, I _think_ they're unrelated to whatever issues you're still having.
Hi, I have just updated our Ansible AWX instance from 9.1.1 to 13.0.0 and I am also getting the warning message that the task container cannot be reached by the web container.
awx.main.wsbroadcast Connection from awx to awx-task failed: 'Cannot connect to host awx-task:443 ssl:False [Connect call failed ('172.19.0.5', 443)]'.
I also renamed the containers according to the possibilities in the inventory of the installation tool.
Contrary to the warning message, however, running the Ansible Playbooks etc. works for me.
The strange thing for me is that the web container tries to contact the task container on port 443 with SSL off. Shouldn't he rather try the request on port 80? Maybe this is where the error lies.
In my opinion the ticket should be opened again, because the logs are filled with this messages.
Did you check that a container with that name exists? It's most likely just a remnant from before the update/s.
Check the date when it was last modified in the DB and delete it if it's before the update (follow the steps we went through to resolve it).
Hi @zigaSRC, thanks for your response. I never changed the names of the docker containers during the last upgrades. The name of the web container is awx-web and the name of the task container is awx-task. I have changed the names of the containers when updating from version 3.0 to 5.0 long time ago.
I also checked via docker container ls if the containers are named accordingly and yes they are.
eef3327e0a04 ansible/awx:13.0.0 awx-task
9f624d48d474 ansible/awx:13.0.0 awx-web
26b9c667a45c postgres:10 awx-postgres
76bc29bf94c6 redis awx-redis
When I execute the command SELECT * FROM main_instance; in the database I get these two entries:
| id | uuid | hostname | created | modified | capacity | version | last_isolated_check | capacity_adjustment | cpu | memory | cpu_capacity | mem_capacity | enabled | managed_by_policy | ip_address |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | 00000000-0000-0000-0000-000000000000 | awx | 2019-03-04 10:08:45.432112 | 2020-07-20 08:30:41.105110 | 18 | 13.0.0 | NULL | 1.00 | 2 | 4136701952 | 8 | 18 | true | true | NULL |
| 2 | 00000000-0000-0000-0000-000000000000 | awx-task | 2019-06-25 10:10:44.397751 | 2019-06-25 10:10:44.397804 | 0 | | NULL | 1.00 | 0 | 0 | 0 | 0 | true | true | NULL |
After I executed the command awx-manage deprovision_instance --hostname=awx-task, the database entry with ID = 2 was successfully deleted. But after a restart of the whole Docker containers a new entry with the hostname awx-task was created again, which again had no values for CPU, memory etc. So the same error message appeared in the logs again, that the container with the name awx-task could not be found.
It is interesting that Ansible AWX executes the playbooks under the hostname awx, although a container with the name awx does not exist.
Thank you very much.
Now, I have also executed the command awx-manage deprovision_instance --hostname=awx to delete the instance with the name awx. But then I get the following error message No instance found with the current cluster host id.
There is already a ticket #3959 #7100 for this error message and it seems that the name of the task container is hard coded in some configuration files. As it looks like it is not a good idea to rename the container names, although the installation routine allows it.
How should we proceed from here?
So, now I have renamed the Docker hostname (not the Docker service name) from the task container to awx and executed the command awx-manage deprovision_instance --hostname=awx-task again. Now the system seems to run stable and only one instance appears.
For me the problem is solved, but in my opinion the configuration option should be removed in the installation routine or fixed.