awx 🚀 - Job details and Job view not working

I am experiencing the same issue.

matthew-hickok on 9 May 2018

👍12

What are you using for a proxy in front of AWX? Do you have your awx_web container bound to 0.0.0.0:port or 127.0.0.1:port? I was experiencing the same issue while accessing AWX behind a nginx proxy running on the Linux host and noticed that when the proxy was disabled the Job detail pages would display properly. After I set the awx_web container to listen on 127.0.0.1, I was longer experiencing the issue. To set the awx_web container to 127.0.0.1, you can specify host_port=127.0.0.1:port (instead of host_port=port) in the installer inventory file.

anasypany on 9 May 2018

I'm having the same issue where the job details will not display (also running with a proxy in front of awx). Adjusting the awx_web container to listen on 127.0.0.1 did not resolve the issue. Prior to upgrading to 1.0.6.5 this was working properly.

ENVIRONMENT
AWX version: 1.0.6.5
AWX install method: docker on linux
Ansible version: 2.5.2
Operating System: Ubuntu 16.04
Web Browser: Firefox/Chrome

In developer tools I'm seeing this error:
WebSocket connection to 'wss://<>/websocket/' failed: WebSocket is closed before the connection is established.

where the <> is the correct uri to my instance.

"/#/jobs?job_search=page_size:20;order_by:-finished;not__launch_type:sync:1 /#/jobz/inventory/33:1". I am also usning Nginx as a front end proxy (port 443).

cstuart1 on 10 May 2018

Thanks for the tip @anasypany and for trying this solution @cstuart1. I indeed also use nginx as front-end proxy as I need SSL and port 443. What I haven't tried yet is via a ssh-tunnel directly connecting to the awx_web container. If the issue then still persist it is in the application itself. However I will not be able to test this today, but it will be the first thing I will do tomorrow morning.

Borrelworst on 10 May 2018

@cstuart1 Can you paste your ngxinx proxy config? (with censored environment details of course)

anasypany on 10 May 2018

@Borrelworst
The solution here is to add in a block for the websocket in your Nginx config

location /websocket {
proxy_pass http://x.x.x.x:80;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
}

@anasypany this is probably what you were going to suggest/inquire about?

cstuart1 on 10 May 2018

👍1

@cstuart1 I was able to get the job details pages working again with this simple nginx proxy config once awx_web was bound to 127.0.0.1:

location / {
proxy_pass http://127.0.0.1:xxxx; (xxxx = 80 in your case)
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}

If you try this config make sure to add HTTP_X_FORWARDED_FOR in your Remote Host Headers on AWX as well. Let me know if you have any luck!

anasypany on 10 May 2018

👍9

Yes, that resolved the issue for me.
I had already added HTTP_X_FORWARDED_FOR to AWX as I'm using SAML for auth.

For someone else reading this thread and trying to setup SAML.
I also had to alter /etc/tower/settings.py (task and web) to have the following:
USE_X_FORWARDED_PORT = True
USE_X_FORWARDED_HOST = True

and restart tower after making the setting change.
This is mentioned in the tower documents but I thought I would post this in-case someone else read this thread.

cstuart1 on 10 May 2018

@cstuart1: That indeed solved the issue. I have not set the awx_web to bound explicitly to 127.0.0.1 and apparently that is not needed. The only issue I still see is that when I go to my custom inventory scripts and click on schudule inventory syncs, I will just see the cog wheel, but nothing happens. This is also described in #1850.

Borrelworst on 11 May 2018

I am also experiencing problems with job details. I deployed a stack with postgres, rabbitmq, memcache, awx_web and awx_task in a swarm (ansible role to check variables, create dirs, instantiating a docker-compose template, deploy and so on). I am using vfarcic docker-flow to provide access to all the services in the swarm and to automatically detect changes in the configuration and reflect those changes in the proxy configuration. Within this stack, only awx_web is provided access outside the swarm with the docker-flow stack.
All works well except that the websocket of the job listing and details works only during rare intervals, usually, when repeated killing daphne and nginx inside awx_web container.
Debugging in the browser, I can see a bunch of websocket upgrades being tried and all of them failing with "502 Bad Gateway" after 5/6 seconds. At the same time, for each of the failing websockets attempts, a message like the one bellow appears in the awx_web log:

2018/05/16 23:36:18 [error] 31#0: *543 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <internal proxy ip>, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "<my specific virtual host>"

Occasionally, the following messages are also printed in the same log:

127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECTING /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECT /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:55] "WSDISCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECTING /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:56] "WSDISCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECTING /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:21] "WSDISCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECTING /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:25:05] "WSDISCONNECT /websocket/" - -
127.0.0.1:34510 - - [16/May/2018:22:42:34] "WSDISCONNECT /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:43] "WSCONNECTING /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:42:57] "WSCONNECTING /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:43:02] "WSDISCONNECT /websocket/" - -
(...)
127.0.0.1:35964 - - [16/May/2018:23:35:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:52] "WSCONNECTING /websocket/" - -
127.0.0.1:37312 - - [16/May/2018:23:35:52] "WSDISCONNECT /websocket/" - -
127.0.0.1:37412 - - [16/May/2018:23:35:57] "WSCONNECTING /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:57] "WSDISCONNECT /websocket/" - -

The haproxy config generated by docker-flow for this service (awx_web) is:

frontend services
(...)
    acl url_awx-stack_awxweb8052_0 path_beg /
    acl domain_awx-stack_awxweb8052_0 hdr_beg(host) -i <my specific virtual host>
    use_backend awx-stack_awxweb-be8052_0 if url_awx-stack_awxweb8052_0 domain_awx-stack_awxweb8052_0
(...)
backend awx-stack_awxweb-be8052_0
    mode http
    http-request add-header X-Forwarded-Proto https if { ssl_fc }
    http-request add-header X-Forwarded-For %[src]
    http-request add-header X-Client-IP %[src]
    http-request add-header Upgrade "websocket"
    http-request add-header Connection "upgrade"
    server awx-stack_awxweb awx-stack_awxweb:8052

It is very similar to a bunch of other services in the swarm.
As far as I can understand, the upstream referenced in the message above refers to daphne inside the awx_web container, that daphne instance is listening on the http://127.0.0.1:8051 and is "called" by the proxy configuration of the nginx, also running inside the same container. I am currently investigating how can one troubleshoot daphne.
I would appreciate if anyone can help me with some ideas or guidelines to proceed with the investigations.
Thanks!

nmpacheco on 17 May 2018

👍2

I'm experiencing the same issue

ENVIRONMENT
AWX version: 1.0.6.8
AWX install method: docker on linux
Ansible version: 2.5.2
Operating System: Debian 9
Web Browser: Firefox/Chrome

leweafan on 21 May 2018

👍5

I have the same issue either

mkoshevoi on 22 May 2018

👍2 👎1

Hi, I had the same issue and i was able to get the jobs output running this command to fix the permissions:

chmod 744 -R /opt/awx/embedded

Rpera on 29 May 2018

Since most of these comments are related to proxy configurations, I should probably mention that I have the same issue but I do not have a proxy in front of mine.

matthew-hickok on 31 May 2018

I'm experiencing the same issue as well. Initially will work fine. I noticed restarting the containers/docker resolves the issue. Will monitor it to determine if issue occurs again, which I assume it will.

SatiricFX on 4 Jun 2018

same error
i use nginx with configuration similar to @anasypany

location / {
    proxy_pass http://127.0.0.1:8052;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
}

but i'm unable to see the job

cavamagie on 13 Jun 2018

@cavamagie

ENVIRONMENT

AWX version: 1.0.6.15
AWX install method: docker on linux
Ansible version: 2.5.4
Operating System: Debian 9
Web Browser: Firefox/Chrome

cat awx/installer/inventory

host_port=127.0.0.1:9999

location / {
    proxy_pass http://127.0.0.1:9999/;
    proxy_http_version 1.1;
    proxy_set_header Host               $host;
    proxy_set_header X-Real-IP          $remote_addr;
    proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto  $scheme;
    proxy_set_header Upgrade            $http_upgrade;
    proxy_set_header Connection         "upgrade";
}

It works for me

bogdansharuk on 13 Jun 2018

@cstuart1 Do you think we can chat out of band regarding SAML setup with AWX? I've been at this for hours with no success.

Edit: I commented on #1016 with details on how to configure AWX for use with SAML auth.

sudomateo on 16 Jun 2018

👍1

Same issues
@SatiricFX I have noticed the same thing: restarting the docker containers usually helps.
Moreover, I am not using any proxy nor https access.

piroux on 18 Jun 2018

@piroux That does resolve it for us as well temporarily. Haven't found a permanent fix for it. Maybe a bug.

SatiricFX on 19 Jun 2018

It appears you can swap the supervisor.conf and add verbose output to daphne:

[program:daphne]
command = /var/lib/awx/venv/awx/bin/daphne -b 127.0.0.1 -p 8051 awx.asgi:channel_layer -v 2

With this I am seeing the following behavior related to websockets from Daphne/nginx:

2018-06-27 03:18:59,295 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!BfsxXxiUPF to WebSocket daphne.response.XbupPxYRcS!ReBXomhGtg
RESULT 2
OKREADY
10.255.0.2 - - [27/Jun/2018:03:19:02 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:03,491 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!ReBXomhGtg
2018-06-27 03:19:21,372 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!aPmLgJGDZd to WebSocket daphne.response.XbupPxYRcS!hTzJudfDoM
10.255.0.2 - - [27/Jun/2018:03:19:24 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:25,571 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!hTzJudfDoM
2018-06-27 03:19:50,862 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!lnvEJzPynj to WebSocket daphne.response.XbupPxYRcS!XCyaFNijYM
10.255.0.2 - - [27/Jun/2018:03:19:53 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:53,999 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!XCyaFNijYM
RESULT 2
OKREADY

This eventually logs:

2018-06-27 03:34:03,939 WARNING  dropping connection to peer tcp4:127.0.0.1:34576 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
10.255.0.2 - - [27/Jun/2018:03:34:03 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018/06/27 03:34:03 [error] 32#0: *147 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "localhost:8080"
2018-06-27 03:34:03,941 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!gbrIRtuqeq

strawgate on 27 Jun 2018

👍2 🎉1

awx_web:1.0.6.23 here:

10.255.0.2 - - [28/Jun/2018:13:31:14 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"
2018/06/28 13:31:14 [error] 25#0: *440 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "awx.prmrgt.com:80"
10.255.0.2 - - [28/Jun/2018:13:31:19 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"

etc. websocket simply not working. All reverse proxy configuration was working before (1.0.3.29 for example). nginx config is fine:

      location / {
        proxy_pass http://10.20.1.100:8053/;
        proxy_http_version 1.1;
        proxy_set_header   Host               $host:$server_port;
        proxy_set_header   X-Real-IP          $remote_addr;
        proxy_set_header   X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto  $scheme;
        proxy_set_header   Upgrade            $http_upgrade;
        proxy_set_header   Connection         "upgrade";
      }

I appended these lines to /etc/tower/settings.py:

USE_X_FORWARDED_PORT = True
USE_X_FORWARDED_HOST = True

I found ansible/awx_web:1.0.6.11 is the latest image working fine for me (this means the websocket reverse proxy settings are fine outside the awx_web!). I hope this helps.

Please not the settings.py changes are not needed for 1.0.6.11 to work. I don't see any impact it I set those or not.

DBLaci on 28 Jun 2018

I am also facing the same issue.

ENVIRONMENT

AWX version: 1.0.6.11
AWX install method: docker on linux
Ansible version: 2.5.7
Operating System: CentOS 7
Web Browser: Firefox/Chrome

They only workaround that is currently working for me is stopping everything and starting again the containers.

josemgom on 28 Jun 2018

👍1

This issue does not appear to occur for a little while after redeploying AWX.

I did however notice that none of the job details from while this issue is occuring are available even after you restart. It appears as though the "stdout" response on the API is populated via the task container posting data to a websocket for that job.

I also noticed that when the issue is occurring that the task container fails with the following errors:

[2018-07-02 19:03:47,717: DEBUG/Worker-4] using channel_id: 2
2018-07-02 19:03:47,718 ERROR    awx.main.models.unified_jobs job 15 (running) failed to emit channel msg about status change
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 1169, in _websocket_emit_status
    emit_channel_notification('jobs-status_changed', status_data)
  File "/usr/lib/python2.7/site-packages/awx/main/consumers.py", line 70, in emit_channel_notification
    Group(group).send({"text": json.dumps(payload, cls=DjangoJSONEncoder)})
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/channels/channel.py", line 88, in send
    self.channel_layer.send_group(self.name, content)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 190, in send_group
    self.send(channel, message)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 95, in send
    self.recover()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 77, in recover
    self.tdata.consumer.revive(self.tdata.connection.channel())
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/connection.py", line 255, in channel
    chan = self.transport.create_channel(self.connection)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 92, in create_channel
    return connection.channel()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/connection.py", line 282, in channel
    return self.Channel(self, channel_id)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 101, in __init__
    self._x_open()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 427, in _x_open
    self._send_method((20, 10), args)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer

This would explain why the job details from jobs that ran while the websockets are not working arent even visible after restarting the web/task container and why they arent available when hitting the stdout resource on the job endpoint

strawgate on 2 Jul 2018

👍2

I ran into this issue as well and resolved it by stopping both the web and task containers and rerunning the installer playbook to start them again.

stmarier on 3 Jul 2018

we have the issue with 1.0.6.0 and not recovering after deleting/recreating pods for awx and etcd

jkhelil on 4 Jul 2018

restarting web/task on one dev host where i was testing directly fixed it.

In Prod i'm facing issues with websocket errors behind custom reverse-proxies - Is it possible via some header hack to disable websocket completely or is that a hard requirement for awx - some libraries have fallback options ?

jijojv on 5 Jul 2018

🎉1 👍1

Decided to take a look at the rabbitmq logs and when websockets stops working I start seeing the following in the logs:

2018-07-07 00:56:02.000 [warning] <0.5148.0> closing AMQP connection <0.5148.0> (10.0.0.6:54140 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.001 [warning] <0.5138.0> closing AMQP connection <0.5138.0> (10.0.0.6:54138 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.001 [warning] <0.4690.0> closing AMQP connection <0.4690.0> (10.0.0.6:53950 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.055 [warning] <0.5182.0> closing AMQP connection <0.5182.0> (10.0.0.6:54150 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.056 [warning] <0.5172.0> closing AMQP connection <0.5172.0> (10.0.0.6:54148 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.057 [warning] <0.4731.0> closing AMQP connection <0.4731.0> (10.0.0.6:53974 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.058 [warning] <0.5192.0> closing AMQP connection <0.5192.0> (10.0.0.6:54198 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection

strawgate on 7 Jul 2018

We're getting the following error everytime we try to click on a job, both running and ones that have already been completed.

WebSocket connection to 'wss://{redacted}/websocket/' failed: WebSocket is closed before the connection is established.

We experienced this both on the latest AWX Web version and on several older revisions. ansible/awx_web:1.0.6.11 in particular was what we tried.

It's worth noting this container sits behind a reverse nginx proxy, but we've tried narrowing this down by removing the proxy all together and still are getting the same errors/issue. We use this very heavily in production, are there any short-term fixes? Container reboots sometimes work for a few minutes, but typically fall back to the same errors.

Logs on AWX Web don't show anything overly useful, and likewise with postgres and task containers. RabbitMQ does show similar results as stated above.

2018-07-09 12:29:48.398 [warning] <0.11522.5> closing AMQP connection <0.11522.5> (10.0.5.240:40382 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-09 12:29:48.398 [warning] <0.17632.5> closing AMQP connection <0.17632.5> (10.0.5.240:46896 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-09 12:29:48.399 [warning] <0.23641.5> closing AMQP connection <0.23641.5> (10.0.5.240:53386 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection

anthonyloukinas on 9 Jul 2018

Seeing this as well with AWX 1.0.6.25 and Asnible 2.6.1.

EDIT: 1.0.6.1 also seems to not work.

Any page requested like this never completely loads and is blank: https://awx/jobs/playbook/8

Playbooks do actually run (and sometimes fail) which works fine for notifications.

JSkier21 on 10 Jul 2018

Same behavior, but not seeing any of the errors others. Also, restarting the pod doesn't fix the issue for any amount of time. It looks like I'm just being sent back to the jobs list page.


10.32.5.17 - - [12/Jul/2018:15:50:50 +0000] "PROXY TCP4 10.32.44.94 10.32.44.94 41275 32132" 400 173 "-" "-"
[pid: 37|app: 0|req: 77/525] 10.244.8.0 () {48 vars in 3205 bytes} [Thu Jul 12 15:50:51 2018] GET /api/v2/inventory_updates/9/ => generated 4586 bytes in 104 msecs (HTTP/1.1 200) 8 headers in 248 bytes (1 switches on core 0)
10.244.8.0 - - [12/Jul/2018:15:50:51 +0000] "GET /api/v2/inventory_updates/9/ HTTP/1.1" 200 4586 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
10.244.6.0 - - [12/Jul/2018:15:50:51 +0000] "OPTIONS /api/v2/inventory_updates/9/ HTTP/1.1" 200 11892 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
[pid: 33|app: 0|req: 238/526] 10.244.6.0 () {50 vars in 3249 bytes} [Thu Jul 12 15:50:51 2018] OPTIONS /api/v2/inventory_updates/9/ => generated 11892 bytes in 149 msecs (HTTP/1.1 200) 8 headers in 249 bytes (1 switches on core 0)
10.244.10.0 - - [12/Jul/2018:15:50:51 +0000] "GET /api/v2/inventory_updates/9/events/?order_by=start_line&page=1&page_size=50 HTTP/1.1" 200 17126 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
[pid: 36|app: 0|req: 123/527] 10.244.10.0 () {48 vars in 3299 bytes} [Thu Jul 12 15:50:51 2018] GET /api/v2/inventory_updates/9/events/?order_by=start_line&page=1&page_size=50 => generated 17126 bytes in 90 msecs (HTTP/1.1 200) 9 headers in 264 bytes (1 switches on core 0)

AWX 1.0.6.17 Ansible 2.5.5 running on Kubernetes

hitmenow on 12 Jul 2018

@Borrelworst
Hey friend, would you be able to paste your entire nginx.conf file? I am having the exact same issue but adding the stanza above did not fix my issue.

This is mine fwiw.
`#user awx;

    worker_processes  1;

    pid        /tmp/nginx.pid;

    events {
        worker_connections  1024;
    }

    http {
        include       /etc/nginx/mime.types;
        default_type  application/octet-stream;

        log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for"';

        map $http_upgrade $connection_upgrade {
            default upgrade;
            ''      close;
        }

        sendfile        on;
        #tcp_nopush     on;
        #gzip  on;

        upstream uwsgi {
            server 127.0.0.1:8050;
            }

        upstream daphne {
            server 127.0.0.1:8051;
        }

        server {
            listen 8052 default_server;

            # If you have a domain name, this is where to add it
            server_name _;
            keepalive_timeout 65;

            # HSTS (ngx_http_headers_module is required) (15768000 seconds = 6 months)
            add_header Strict-Transport-Security max-age=15768000;

            location /nginx_status {
              stub_status on;
              access_log off;
              allow 127.0.0.1;
              deny all;
            }

            location /static/ {
                alias /var/lib/awx/public/static/;
            }

            location /favicon.ico { alias /var/lib/awx/public/static/favicon.ico; }

            location ~ ^/(websocket|network_ui/topology/) {
                # Pass request to the upstream alias
                proxy_pass http://daphne;
                # Require http version 1.1 to allow for upgrade requests
                proxy_http_version 1.1;
                # We want proxy_buffering off for proxying to websockets.
                proxy_buffering off;
                # http://en.wikipedia.org/wiki/X-Forwarded-For
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                # enable this if you use HTTPS:
                proxy_set_header X-Forwarded-Proto https;
                # pass the Host: header from the client for the sake of redirects
                proxy_set_header Host $http_host;
                # We've set the Host header, so we don't need Nginx to muddle
                # about with redirects
                proxy_redirect off;
                # Depending on the request value, set the Upgrade and
                # connection headers
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection $connection_upgrade;
            }

            location / {
                # Add trailing / if missing
                rewrite ^(.*)$http_host(.*[^/])$ $1$http_host$2/ permanent;
                uwsgi_read_timeout 120s;
                uwsgi_pass uwsgi;
                include /etc/nginx/uwsgi_params;
            }
        }
    }`

hitmenow on 24 Jul 2018

PSA: If anyone here is using Docker SWARM and having these issues, try to run the same stack just using docker-compose (non-swarm v2), and see if you have the same issues.

The issues in this thread were all symptoms we were seeing whilst running in Swarm mode. Once we switched to local instances (docker-compose), we haven't had any issues running AWX behind an Nginx Proxy (specifically Jwilder's with custom SSL Certificates).

Just wanted to toss this tidbit out there. RedHat/AWX team has specifically stated AWX is NOT swarm supported, but I know it makes sense for a lot of people to use Swarm.

anthonyloukinas on 24 Jul 2018

@anthonyloukinas, I'm not in swam and using docker-compose and it doesn't display job status properly at all.

JSkier21 on 24 Jul 2018

@hitmenow Here bellow is my server block, I left the original congifuration intact, but just created a conf file in conf.d:

   server {
   ssl   on;

   listen       443 ssl default_server;
   server_name <servername>;
   ssl_certificate <certfile>;
   ssl_certificate_key <keyfile>;
   proxy_set_header    X-Forwarded-For    $remote_addr;
   include /etc/nginx/default.d/*.conf;

   location / {
        proxy_pass http://localhost:80/;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
   }

   error_page 404 /404.html;
       location = /40x.html {
   }

   error_page 500 502 503 504 /50x.html;
       location = /50x.html {
   }

  }

It does work for me most of the time, but occasionally I have to restart docker to fix the issue again. The fact that so many people have the same issue tells me that or the documentation is not sufficient, or there is really a bug in the software causing the issue.

Borrelworst on 25 Jul 2018

@anthonyloukinas I'm not sure RedHat provides any support for AWX so it not being supported by RedHat isn't a huge deal -- we are just hoping for some help from the team to figure out what is causing this in the scenarios it's occurring in (with and without swarm) so we can contribute an open-source fix -- nobody seems to be providing any guidance or insight, which is understandable, but in my opinion we should keep collecting more information here.

strawgate on 25 Jul 2018

👍2

What I've noticed is once websockets stop working, subsequent attempts at the websocket opening handshake never complete. Running tcpdump on the web container on port 8051 shows web never sends out the accept-upgrade response.

I've traced the websocket connect request path and it's kind of messy. A websocket request gets handled by web but web defers responding to the handshake. Instead what happens is web creates a message on rabbitmq that a websocket connect was received. Task then picks up this message, puts a message back on rabbitmq with the contents {"accept": True}, and once web receives this message it sends out the handshake response to the client, successfully establishing a websocket connection.

What seems to be happening is that, at some point, there is a mismatch between the channels where web and task look for and place their messages (i.e. web listens for accept messages on channel A but task is sending those messages on channel B). Restarting the supervisor deamons on web and task at the same time (and other workarounds) seem to fix the issue but only temporarily. I'm also not sure why web isn't handling the websocket handshake response itself.

Full disclosure, I've only been running into these problems when deploying AWX in a swarm environment where each container has no replicas. It looks like something about swarm is causing the channels used for communication b/t web and task to de-synchronize.

konkolorado on 25 Jul 2018

Thank you @Borrelworst! I have a different scenario than you I think. I have a load balancer in front of my containers which has SSL termination. And my nginx server is listening on 8052. Will do some more troubleshooting. Thanks again

hitmenow on 26 Jul 2018

I resolved when set the endpoint_mode of RabbitMQ to dnsrr in the Docker Swarm Mode.
The rabbitmq stack in compsoe file is:

  rabbitmq:
    image: rabbitmq:3
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      endpoint_mode: dnsrr
    environment:
      RABBITMQ_DEFAULT_VHOST: "awx"
    networks:
      - webnet

sightseeker on 26 Jul 2018

👍3 ❤2

Switching to dnsrr instead of VIP kind of implies that it's an issue with the VIP timing out the idle connection --

https://github.com/moby/moby/issues/37466#issuecomment-405307656
https://success.docker.com/article/ipvs-connection-timeout-issue

This would match with the described behavior where it works initially and then at some undefined later time (relatively quickly) it stops working.

strawgate on 27 Jul 2018

👍1

@sightseeker Is there an equivalent that you know of for Kubernetes deployments?

hitmenow on 27 Jul 2018

Thankyou @strawgate !
When I set tcp_keepalive_timeout to less than 900 secs and using vip mode, the problem no longer occurs.

@hitmenow I haven't tried yet with K8s.

sightseeker on 28 Jul 2018

It would also imply that switching the containers to using tasks.rabbitmq to hit rabbitmq would fix the issue as that bypasses the VIP too. Will test and report back

strawgate on 29 Jul 2018

@hitmenow Kubernetes doesnt use VIP or swarm networking so dnsrr is probably not related to your issue.

strawgate on 3 Aug 2018

I'm running AWX in pure docker containers on the same machine (no swarm or k8s) and I was hitting this issue too.

Setting net.ipv4.tcp_keepalive_time=600 helped me as well, but it needs to be set before daphne runs, so it should be put into /etc/sysctl.conf on the host system or similar.

onitake on 7 Aug 2018

I just updated the tcp_keepalive in my staging and production environment. I will check if this solution helps to the issue.

josemgom on 7 Aug 2018

I have the same issue either

ENVIRONMENT

AWX version: 1.0.7
AWX install method: docker on linux
Ansible version: 2.5.4
Operating System: CentOS 7
Web Browser: Firefox/Chrome

dadudu81 on 16 Aug 2018

I have this issue as well. I was on 1.0.4.50 and that was working fine. I've moved up to 1.0.7.0 and now I just see a spinning 'working' wheel when try to see job history. I've tried different browsers and incognito windows but no change.

I'm running AWX just on normal docker. Not on k8s or openshift.

I was using haproxy in front for SSL offload but I still see the same if I browse to the awx_web container on its exposed web port (8052)

grahamneville on 16 Aug 2018

grahamneville - do you have any container logs we can take a look at?

jakemcdermott on 16 Aug 2018

@jakemcdermott

I've tried a few things, listed below, that people have suggested fixed the issue and some more but I've had no luck.

Hitting AWX_WEB directly and not using any proxy in front
Multiple Browsers, clearing cache and incognito windows
Deleting all containers and removing the postgres database storage and doing a fresh install
Setting host_port=127.0.0.1:port in the inventory file for exposing the port in awx_web
Changed /etc/tower/settings to have 'USE_X_FORWARDED_PORT = TrueandUSE_X_FORWARDED_HOST = True` which I baked in to a new build
Changed net.ipv4.tcp_keepalive_time to net.ipv4.tcp_keepalive_time=600 and restarted the docker service on the host and restarted all containers
chmod 744 -R /opt/awx/embedded - /opt/awx/embedded doesn't exist on the containers
Reverted commit 2d4fbffb919884a8f9fb6ba690756cefd61929c7

These are the logs I see from the awx_web container, I'm not seeing anything coming through at the same time on any of the other containers.

[pid: 138|app: 0|req: 29/440] 1.1.1.1 () {50 vars in 2485 bytes} [Fri Aug 17 08:16:22 2018] OPTIONS /api/v2/jobs/744/ => generated 12949 bytes in 216 msecs (HTTP/1.1 200) 10 headers in 387 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "OPTIONS /api/v2/jobs/744/ HTTP/1.1" 200 12949 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
[pid: 136|app: 0|req: 258/441] 1.1.1.1 () {48 vars in 2447 bytes} [Fri Aug 17 08:16:22 2018] GET /api/v2/jobs/744/ => generated 9971 bytes in 237 msecs (HTTP/1.1 200) 10 headers in 386 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "GET /api/v2/jobs/744/ HTTP/1.1" 200 9971 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "GET /api/v2/jobs/744/job_events/?order_by=-counter&page=1&page_size=50 HTTP/1.1" 200 62930 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
[pid: 135|app: 0|req: 29/442] 1.1.1.1 () {48 vars in 2544 bytes} [Fri Aug 17 08:16:22 2018] GET /api/v2/jobs/744/job_events/?order_by=-counter&page=1&page_size=50 => generated 62930 bytes in 415 msecs (HTTP/1.1 200) 11 headers in 402 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 259/443] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:22 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:24 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 260/444] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:24 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:26 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 261/445] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:26 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:28 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 262/446] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:28 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:30 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 137|app: 0|req: 84/447] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:30 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:32 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 263/448] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:32 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:34 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 264/449] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:34 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)

It's just the job details/history view that's a problem and the fact you don't get to see the job running in real time when you launch a new job, every other page loads fine.
This is one of the URLs that I'm trying to get to, as seen when clicking on the job in the jobs view:
https://ourawxhost/#/jobs/playbook/750?job_search=page_size%3A20%3Border_by%3A-finished%3Bnot__launch_type%3Async

grahamn-gr on 17 Aug 2018

Any suggestions on what can be done to troubleshoot this further please?

grahamneville on 20 Aug 2018

Also having this problem in k8s. Tried a few things listed here, but still will randomly get closed sockets even when directly connected to the web container. If there are any debugging things to run, I can do so if needed.

SamKirsch10 on 20 Aug 2018

I'm unclear on what might be causing the closed sockets SamKirsch mentioned, but that sounds like a deeper, different issue and one not entirely constrained to the job details page?

There are some race conditions involving setting up the initial connection to the job details page that have been resolved downstream and will be landing in AWX shortly.

These changes _might_ resolve some of the issues mentioned by others above - one way to know if they will help is if you're currently still able to see dynamic updates to socket-driven content other than the incoming output lines (status icons, elapsed times, project updates, etc.).

If _nothing_ is updating dynamically anywhere on the app during job runs then this points to a potentially deeper configuration issue. If this is the case for you it might be worth opening a separate github issue (or visiting our IRC channel) to help in tracking your specific problem down, as there are many different potential underlying causes for socket connectivity issues.

jakemcdermott on 21 Aug 2018

The closed sockets I am talking about are all in this thread. Closed websockets. I notice closed websockets after an unspecified time (it's not always the same) when I try to view job details and also jobs that are running / have run. This does not mean it never shows, sometimes a full container restart lets everything show again. I hope the upcoming upstream changed will help :)

SamKirsch10 on 21 Aug 2018

So I've found the reason for my issues and why I couldn't see the job details. It was down to the chrome version I had installed.

61.0.3163.79 caused issues where the 'working' wheel was just spinning.
Upgrading to 67.0.3396.99 fixed these issues and I can now see the job details.

grahamn-gr on 21 Aug 2018

👍3

@grahamn-gr
Thanks for your answer, I updated my chrome to newest version and the problem solved!

dadudu81 on 22 Aug 2018

It sounds like a number of people are having better luck with a newer version of Chrome, though from the variety of comments, it feels like this ticket has become a catch-all for any sort of odd bug related to the job details page.

I'm going to go ahead and close this; if anybody continues to encounter issues in 1.0.7, please let us know by filing a new issue with details.

ryanpetrello on 22 Aug 2018

@ryanpetrello jfyi still facing this issue, version 1.0.7.2

boris-42 on 31 Aug 2018

@boris-42 can you provide the environment details from https://github.com/ansible/awx/issues/new?template=bug_report.md, including web browser version?

ryanpetrello on 31 Aug 2018

@ryanpetrello

We are using official image 1.0.7.2
Web browser is not the problem (we tried on different, on different OS)

Some observation:

if we curl this "api/v2/jobs//stdout/" it's empty
After restart of awx web and awx task it gets populated
In logs of awx task we see " File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 1169, in _websocket_emit_status" the same as in one of aboves comments
After restarting it works for ~15 minutes
Seems like problem between awx-task and rabbitmq...

boris-42 on 31 Aug 2018

It sounds to me like job events aren't being saved into the database. This can be caused by a number of things. Do you see anything when you visit /api/v2/jobs/N/event/?

ryanpetrello on 31 Aug 2018

@ryanpetrello I suspect you meant jobs_events.

it returns

{
  "count": 0, 
  "next": null, 
  "previous": null, 
  "results": []
}

If I restart awx-task and awx-web this information gets populated. And it continues working until we see in awx-task that log message related to rabbitmq

boris-42 on 31 Aug 2018

Yep, that's exactly what I meant, thanks :)

In your awx task container, can you run:

supervisorctl -c /supervisor_task.conf status

ryanpetrello on 31 Aug 2018

@ryanpetrello

bash-4.2$ supervisorctl -c /supervisor_task.conf status
awx-config-watcher                  RUNNING   pid 195, uptime 12:38:18
tower-processes:callback-receiver   RUNNING   pid 199, uptime 12:38:18
tower-processes:celery              RUNNING   pid 196, uptime 12:38:18
tower-processes:celery-watcher      RUNNING   pid 198, uptime 12:38:18
tower-processes:channels-worker     RUNNING   pid 197, uptime 12:38:18

boris-42 on 31 Aug 2018

@ryanpetrello

Some more information:

If I create schedule and run jobs every 3-5 minutes it works perfectly
If I create schedule and run jobs with gap of 20 minutes it stops working

boris-42 on 1 Sep 2018

@ryanpetrello Some more details. Bug is reproduced on many version of AWX.

If i run /usr/bin/awx-manage run_callback_receiver in task container

All results get send to database...

More interesting thing is this piece of code:
https://github.com/ansible/awx/blob/devel/awx/main/management/commands/run_callback_receiver.py#L233-L238

If something happens to rabbitmq and we got broken connection it's not recrated, from other side we have large try/except in code that uses connection, which doesn't let run_callback_reciever crash so supervisor will be bring it back...

boris-42 on 2 Sep 2018

@boris-42 the example you linked is catching KeyboardInterrupt - I'd expect the callback receiver to gracefully handle and recover from AMQP unavailability in the way you described (testing this a bit myself).

ryanpetrello on 5 Sep 2018

I'm having a hard time reproducing this by stopping RabbitMQ - the callback receiver recovers for me after stopping and starting the message broker:

It also seems resilient to me screwing w/ TCP via tcpkill:

ryanpetrello on 5 Sep 2018

@boris-42 do you see any logs in the task container for the callback receiver that might provide some hints?

ryanpetrello on 5 Sep 2018

IMHO, I don't know why this issue is closed when is still happening, even with the recent versions.

josemgom on 11 Sep 2018

👍2

@josemgom the reason it's closed is that the original reporter described their issue and found a solution to it here: https://github.com/ansible/awx/issues/1861#issuecomment-388286258

(also, see: https://github.com/ansible/awx/issues/1861#issuecomment-415033350)

The number of people chiming in on this one has generated a lot of noise; it's likely people are encountering a _number_ of issues across a variety of configurations that are being conflated:

some people are using older awx versions with resolved bugs
some are deploying behind a proxy and needed additional X-Forwarded-For configuration
some have reported that things work better with a newer version of Chrome

If you're still encountering an issue with the job details page, and you're using the most recent version of awx, _and_ none of the suggestions in this comment thread have addressed it for you, then please open a new issue with as much detail as possible about the problem you're encountering: https://github.com/ansible/awx/issues/new?template=bug_report.md

In the meantime, I and other awx maintainers are happy to help as much as possible here (see my and others' various interactions with people above) and in our IRC room on freenode (#awx-devel).

ryanpetrello on 11 Sep 2018

@ryanpetrello you are back ! =)

Steps to reproduce:

My production deployment is running on top of k8s and looks like, this:
-- awx-rabbitmq is statefulset with 3 replicas
-- memcahced and postgres are 2 deployments
-- awx-web is coupled with awx-task in the same pod as part of one deployment (there is some bug that we are still debugging that is blocking us from decoupling)
After deploying everything, don't touch anything for 15+ minutes
Run any job template (demo one for example)
You won't see the logs in output
If you restart callback receiver logs are populated
(if you don't run anything for next 15 minutes issue is going to be reproduced)

boris-42 on 11 Sep 2018

Hey @boris-42,

Do you see any logs in the task container for the callback receiver that might provide some hints? Errors/exceptions/tracebacks?

ryanpetrello on 11 Sep 2018

@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer errors: we _think_ we might have an idea of what's causing this issue. If any of you are feeling like experimenting, could you give this PR a try in your environments to see if it improves things?

https://github.com/ansible/awx/pull/2391

Alternatively, you could try running something like this (in all of your containers) and _then_ restarting awx services to get the latest version:

~ /var/lib/awx/venv/awx/bin/pip uninstall asgi-amqp
~ /var/lib/awx/venv/awx/bin/pip install "asgi-amqp==1.1.2"

ryanpetrello on 9 Oct 2018

@ryanpetrello Thanks, I'll try to patch container this weekend!

boris-42 on 9 Oct 2018

Thanks @ryanpetrello

I just upgraded the package in my development and production envs. I let you know if the users still facing this issue.

josemgom on 10 Oct 2018

Running:
/var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2 brought in a newer version of kombu 4.2.1 which starts breaking daphne/celery badly.

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/bin/daphne", line 11, in <module>
    sys.exit(CommandLineInterface.entrypoint())
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 144, in entrypoint
    cls().run(sys.argv[1:])
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 174, in run
    channel_layer = importlib.import_module(module_path)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/lib/python2.7/site-packages/awx/asgi.py", line 9, in <module>
    prepare_env() # NOQA
  File "/usr/lib/python2.7/site-packages/awx/__init__.py", line 55, in prepare_env
    if not settings.DEBUG: # pragma: no cover
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 56, in __getattr__
    self._setup(name)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 41, in _setup
    self._wrapped = Settings(settings_module)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 110, in __init__
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/lib/python2.7/site-packages/awx/settings/production.py", line 17, in <module>
    from defaults import *  # NOQA
  File "/usr/lib/python2.7/site-packages/awx/settings/defaults.py", line 7, in <module>
    import djcelery
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/djcelery/__init__.py", line 34, in <module>
    from celery import current_app as celery  # noqa
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/five.py", line 312, in __getattr__
    module = __import__(self._object_origins[name], None, None, [name])
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/_state.py", line 20, in <module>
    from celery.utils.threads import LocalStack
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/__init__.py", line 405, in <module>
    from .functional import chunks, noop                    # noqa
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/functional.py", line 19, in <module>
    from kombu.utils.compat import OrderedDict
ImportError: cannot import name OrderedDict

Running:
/var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2 kombu==3.0.37 and holding back kombu appears to have worked. No more Connection reset by peer errors and the job details load!

ENVIRONMENT

AWX version: 2.0.0
AWX install method: docker on linux
Ansible version: 2.6.5
Operating System: Ubuntu 18.04
Web Browser: Firefox/Chrome

taspotts on 10 Oct 2018

👍1

@taspotts thanks for the feedback. We've merged the asgi_amqp update and are planning to release it in a new version of awx in the near future.

ryanpetrello on 10 Oct 2018

@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer errors: we've released a new version of awx, 2.0.1, which we believe should resolve this issue. Please give it a shot and let us know if you continue to encounter issues!

ryanpetrello on 11 Oct 2018

I also had this error and verified that it was fixed in the latest released docker-image.
Thanks for addressing this issue!

nightvisi0n on 17 Oct 2018

👍3

Closing this, please reopen if it persists.

wenottingham on 22 Oct 2018

@ryanpetrello thanks for fixing this, I checked it finally yesterday, everything works.

boris-42 on 26 Oct 2018

Awx: Job details and Job view not working

ISSUE TYPE

COMPONENT NAME

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

Most helpful comment

All 82 comments

Related issues