Having issues with 504 timeouts on job_events APIs for a given host when a job is running, e.g. api/v2/hosts/1/job_events/. Maybe due to deadlocks?
EDIT: We redeployed awx_web to see if we could make the APIs work again, still 504 timeout on every api/v2/hosts/1/job_events/ request. Something is very wrong here.
request.get() (from both a Linux and Mac client)It's important to note that when this occurs we get sporadic 504's on other endpoints too, but I believe this may happen due to sync workers and not enough of them.
Being able to fetch job_events when a task is running.
Timeout on job_events.
We found #6108 and disabled external logging to Logstash, but don't seem like the issue was solved.
The timeout is happening due to OpenShift cutting the connection (499), but the request should never take more than 30 seconds anyway:

We increased the container timeout(previously 30 seconds) to 90 second timeout. We can now see the response from the API takes anything 45-65 seconds. However, this morning (when I woke up to all the alarms) the API responded in around 0.8 seconds. The task started, and we got the long request replies/timeouts again.
Hey @JonasKs,
Would you mind giving 10.0.0 a try? I expect it might resolve this issue for you; others have reported similar issues here:
Hi. We will deploy 10.0.0 and will report back ASAP. Cheers!
@JonasKs I'm gonna go ahead and close this a duplicate of #6391, because I very highly suspect you're encountering the same issue reported there. If you see otherwise when you upgrade to 10.0.0, let me know and I'll take a peek.
Most helpful comment
@JonasKs I'm gonna go ahead and close this a duplicate of #6391, because I very highly suspect you're encountering the same issue reported there. If you see otherwise when you upgrade to 10.0.0, let me know and I'll take a peek.