Awx: Job Template Runtime performance is dramatically impacted by Notification on Start

Created on 20 Aug 2019 · 11Comments · Source: ansible/awx

ISSUE TYPE

Bug Report

SUMMARY

Notify on Job Start notifications can dramatically reduce the performance of a JT run. In the scenario:

JT: multivault.yml
Notifications: 26 Slack Notifications being triggered on launch and on success
Runtime is 4:25
Note: the time between success messages is considerable (15 seconds). The time between success messages is less than a second

JT: multivault.yml
Notifications: Disabled
Runtime is 0:03

JT: multivault.yml
Notifications: Notification on success
Runtime is 0:03

Note: this is occurs with Webhook notifications as well

STEPS TO REPRODUCE

Create a generic JT
Run to completion
Create 26 Slack Notifications
Add Notifications to JT on step 1 to start on Launch
Run to completion

EXPECTED RESULTS

Expect JT runtimes to be similar and notifications to trigger async.

ACTUAL RESULTS

Runtime performance dramatically impacted

ADDITIONAL INFORMATION

26 Notifications
Screen Shot 2019-08-20 at 3 19 12 PM

No notifications
Screen Shot 2019-08-20 at 3 23 53 PM

api high bug

Source

unlikelyzero

Most helpful comment

Added a small change that tells notification_data not to bother with blocking if the job is still running.

With that change:

job w/ 26 notifications set to fire on job start: 9.863 seconds
job w/out notifications: 7.0 seconds

jladdjr on 22 Aug 2019

🎉2

All 11 comments

Quick poking at the code. We create notifications at the beginning of when a job enters the running state. The created notifications are created such that they will run in a background task. Maybe creating the notifications is very expensive? This sure does smell of the notification not being processed in the background.

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1172

chrismeyersfsu on 20 Aug 2019

~It seems unlikely, but a part of me wondered if https://github.com/ansible/awx/issues/4533 might have negatively impacted performance as a side effect. (It's not clear to me that it would, but the change did happen recently and the change happened pretty close to where we register notifications to run asynchronously, so I'm a bit suspicious of the change).~

(UPDATE: Found smoking gun, described later in thread)

jladdjr on 21 Aug 2019

@unlikelyzero to clarify, is this a regression you've noticed in devel, or as part of @jladdjr's open PR: https://github.com/ansible/awx/pull/4291 ?

ryanpetrello on 21 Aug 2019

@ryanpetrello it does not appear on devel but is likely related to the custom messages changes

unlikelyzero on 21 Aug 2019

👍2

Bingo:
https://github.com/ansible/awx/blame/devel/awx/main/models/jobs.py#L676

Walked through the code and confirmed we're hitting this repeatedly for 'start' notifications.

I haven't walked through the code w/ success / failure notifications, but at a higher level did confirm that success messages fire with hardly a delay.

jladdjr on 22 Aug 2019

cc @matburt @ryanpetrello - saw you all chatting about ^ this this morning. Based on what I read, sounds like there may be some hidden side-effects with pulling the sleep out. Can you provide some context on that?

jladdjr on 22 Aug 2019

okay, I think I see why this is only affecting start notifications:

run task calls self.instance.send_notification_templates("running")
.. which calls self.build_notification_message(nt, status)
.. which calls self.notification_data() twice
notification_data polls for self.job_host_summaries.all() and sleeps (up to 5 seconds by default) until something is returned
.. _but_, since all of this is being called _before_ the job has finished running, I wouldn't expect there to be any job summaries

So, doing the math:

self.notification_data() is called twice
.. and there's a total of 5 seconds of sleep each time
and @unlikelyzero used 26 notifications
that gives us 2 * 5 * 26 = 260 seconds which is 4 minutes and 20 seconds

That's only 5 seconds longer than @unlikelyzero's total run time, which sounds about right given that his other jobs only took 3 seconds to run.

I think we have our smoking gun.

jladdjr on 22 Aug 2019

🎉1

Added a small change that tells notification_data not to bother with blocking if the job is still running.

With that change:

job w/ 26 notifications set to fire on job start: 9.863 seconds
job w/out notifications: 7.0 seconds

jladdjr on 22 Aug 2019

🎉2

Oh man @jladdjr, excellent detective work on that one - that would have been a really noticeable regression and definitely caused problems for people using start notifications.

ryanpetrello on 22 Aug 2019

Danke! Huge props to @unlikelyzero for finding the nasty regression in the first place! Seriously, good find.