Awx: 504 timeouts when fetching job_events for a running job

Created on 31 Mar 2020  ยท  4Comments  ยท  Source: ansible/awx

ISSUE TYPE
  • Bug Report
SUMMARY

Having issues with 504 timeouts on job_events APIs for a given host when a job is running, e.g. api/v2/hosts/1/job_events/. Maybe due to deadlocks?

EDIT: We redeployed awx_web to see if we could make the APIs work again, still 504 timeout on every api/v2/hosts/1/job_events/ request. Something is very wrong here.

ENVIRONMENT
  • AWX version: 9.3.0
  • AWX install method: openshift manual deployment
  • Ansible version: 2.9.5
  • Operating System: Official AWX images
  • Web Browser: Firefox, postman, Python's request.get() (from both a Linux and Mac client)
STEPS TO REPRODUCE
  1. Start a task that takes long time to finish (Our task is a task that writes hundreds of commands on a device and takes about 8 minutes to finish)
  2. Fetch job events API while task is running
  3. 504 timeouts on that API even after the job is done for a loong time. (It did start responding again earlier, but now I've waited 10 minutes and still no response from the API).

It's important to note that when this occurs we get sporadic 504's on other endpoints too, but I believe this may happen due to sync workers and not enough of them.

EXPECTED RESULTS

Being able to fetch job_events when a task is running.

ACTUAL RESULTS

Timeout on job_events.

ADDITIONAL INFORMATION

We found #6108 and disabled external logging to Logstash, but don't seem like the issue was solved.

The timeout is happening due to OpenShift cutting the connection (499), but the request should never take more than 30 seconds anyway:

image

bug

Most helpful comment

@JonasKs I'm gonna go ahead and close this a duplicate of #6391, because I very highly suspect you're encountering the same issue reported there. If you see otherwise when you upgrade to 10.0.0, let me know and I'll take a peek.

All 4 comments

We increased the container timeout(previously 30 seconds) to 90 second timeout. We can now see the response from the API takes anything 45-65 seconds. However, this morning (when I woke up to all the alarms) the API responded in around 0.8 seconds. The task started, and we got the long request replies/timeouts again.

Hey @JonasKs,

Would you mind giving 10.0.0 a try? I expect it might resolve this issue for you; others have reported similar issues here:

https://github.com/ansible/awx/issues/6391

Hi. We will deploy 10.0.0 and will report back ASAP. Cheers!

@JonasKs I'm gonna go ahead and close this a duplicate of #6391, because I very highly suspect you're encountering the same issue reported there. If you see otherwise when you upgrade to 10.0.0, let me know and I'll take a peek.

Was this page helpful?
0 / 5 - 0 ratings