Awx: 504 timeouts when fetching job_events for a running job

Created on 31 Mar 2020 · 4Comments · Source: ansible/awx

ISSUE TYPE

Bug Report

SUMMARY

Having issues with 504 timeouts on job_events APIs for a given host when a job is running, e.g. api/v2/hosts/1/job_events/. Maybe due to deadlocks?

EDIT: We redeployed awx_web to see if we could make the APIs work again, still 504 timeout on every api/v2/hosts/1/job_events/ request. Something is very wrong here.

ENVIRONMENT

AWX version: 9.3.0
AWX install method: openshift manual deployment
Ansible version: 2.9.5
Operating System: Official AWX images
Web Browser: Firefox, postman, Python's request.get() (from both a Linux and Mac client)

STEPS TO REPRODUCE

Start a task that takes long time to finish (Our task is a task that writes hundreds of commands on a device and takes about 8 minutes to finish)
Fetch job events API while task is running
504 timeouts on that API even after the job is done for a loong time. (It did start responding again earlier, but now I've waited 10 minutes and still no response from the API).

It's important to note that when this occurs we get sporadic 504's on other endpoints too, but I believe this may happen due to sync workers and not enough of them.

EXPECTED RESULTS

Being able to fetch job_events when a task is running.

ACTUAL RESULTS

Timeout on job_events.

ADDITIONAL INFORMATION

We found #6108 and disabled external logging to Logstash, but don't seem like the issue was solved.

The timeout is happening due to OpenShift cutting the connection (499), but the request should never take more than 30 seconds anyway:

bug

Source

JonasKs

👍1

Most helpful comment

@JonasKs I'm gonna go ahead and close this a duplicate of #6391, because I very highly suspect you're encountering the same issue reported there. If you see otherwise when you upgrade to 10.0.0, let me know and I'll take a peek.

ryanpetrello on 31 Mar 2020

❤1 👍1

All 4 comments

We increased the container timeout(previously 30 seconds) to 90 second timeout. We can now see the response from the API takes anything 45-65 seconds. However, this morning (when I woke up to all the alarms) the API responded in around 0.8 seconds. The task started, and we got the long request replies/timeouts again.

JonasKs on 31 Mar 2020

Hey @JonasKs,

Would you mind giving 10.0.0 a try? I expect it might resolve this issue for you; others have reported similar issues here:

https://github.com/ansible/awx/issues/6391

ryanpetrello on 31 Mar 2020

👍1

Hi. We will deploy 10.0.0 and will report back ASAP. Cheers!

JonasKs on 31 Mar 2020

ryanpetrello on 31 Mar 2020

❤1 👍1

Was this page helpful?

0 / 5 - 0 ratings