Notify on Job Start notifications can dramatically reduce the performance of a JT run. In the scenario:
JT: multivault.yml
Notifications: 26 Slack Notifications being triggered on launch and on success
Runtime is 4:25
Note: the time between success messages is considerable (15 seconds). The time between success messages is less than a second
JT: multivault.yml
Notifications: Disabled
Runtime is 0:03
JT: multivault.yml
Notifications: Notification on success
Runtime is 0:03
Note: this is occurs with Webhook notifications as well
Expect JT runtimes to be similar and notifications to trigger async.
Runtime performance dramatically impacted
26 Notifications

No notifications

Quick poking at the code. We create notifications at the beginning of when a job enters the running state. The created notifications are created such that they will run in a background task. Maybe creating the notifications is very expensive? This sure does smell of the notification not being processed in the background.
https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1172
~It seems unlikely, but a part of me wondered if https://github.com/ansible/awx/issues/4533 might have negatively impacted performance as a side effect. (It's not clear to me that it would, but the change did happen recently and the change happened pretty close to where we register notifications to run asynchronously, so I'm a bit suspicious of the change).~
(UPDATE: Found smoking gun, described later in thread)
@unlikelyzero to clarify, is this a regression you've noticed in devel, or as part of @jladdjr's open PR: https://github.com/ansible/awx/pull/4291 ?
@ryanpetrello it does not appear on devel but is likely related to the custom messages changes
Bingo:
https://github.com/ansible/awx/blame/devel/awx/main/models/jobs.py#L676
Walked through the code and confirmed we're hitting this repeatedly for 'start' notifications.
I haven't walked through the code w/ success / failure notifications, but at a higher level did confirm that success messages fire with hardly a delay.
cc @matburt @ryanpetrello - saw you all chatting about ^ this this morning. Based on what I read, sounds like there may be some hidden side-effects with pulling the sleep out. Can you provide some context on that?
okay, I think I see why this is only affecting start notifications:
run task calls self.instance.send_notification_templates("running")self.build_notification_message(nt, status)self.notification_data() twice notification_data polls for self.job_host_summaries.all() and sleeps (up to 5 seconds by default) until something is returnedSo, doing the math:
self.notification_data() is called twiceThat's only 5 seconds longer than @unlikelyzero's total run time, which sounds about right given that his other jobs only took 3 seconds to run.
I think we have our smoking gun.
Added a small change that tells notification_data not to bother with blocking if the job is still running.
With that change:
Oh man @jladdjr, excellent detective work on that one - that would have been a really noticeable regression and definitely caused problems for people using start notifications.
Danke! Huge props to @unlikelyzero for finding the nasty regression in the first place! Seriously, good find.

Verified:
Without NT on Start or Success:
0:02
20+ Notification on Start and Success:
0:04
20+ nonfunctional NTs on Start and Success
0:04
Most helpful comment
Added a small change that tells
notification_datanot to bother with blocking if the job is still running.With that change: