@sivel has a perf callback plugin: https://github.com/ansible/ansible/pull/46346
We should integrate/merge this into ours to collect perf stats.
This is for collection only. Visualization would be other issues/initiatives.
Tasks:
Runner.event_callback@matburt you mentioned that we would probably subclass the existing AWX callback plugin from this new perf plugin. Alternatively, I think it makes more sense to not inherit one from the other, but just have both enabled. This should involve no changes to the existing callback plugin. One more reason for this is that they are both different plugin types, the perf callback module has CALLBACK_TYPE = 'aggregate', whereas the AWX (and now ansible-runner) callback module is the 'stdout' type.
We specify the standard callback module via ANSIBLE_STDOUT_CALLBACK. I think the perf callback plugin would be enabled via ANSIBLE_CALLBACK_WHITELIST, which is a list.
Incidentally, this is the same thing I had in mind when I filed https://github.com/ansible/awx/issues/41, it would have similarly been a stats gathering thing.
The problem I was worried about is that we need to be able to get the data in our plugin in order to send it along with the rest of the event data.
@wenottingham @jladdjr and I met today.
One question that came up is:
As a user, I would mostly want to see this if I was actively developing or debugging a playbook. It would create alot of noise that would not be desirable to see on every run.
@kialam looks like we will need to surface this option in the UI some how
Initial idea was a global (system-wide).
I could see it at a per-JT level, but it's not required.
Talking with @chrismeyersfsu :
1) Setting should be system wide
2) Stats should not show with each task, and this would be extra work to do. "It should just do the right thing"
3) We might need to do UI work to make summary at bottom "pretty"
4) Not concerned about growth in job event size for DB
The recap output by the cgroup_perf_recap plugin can be disabled. I'd imagine if the data is being collected with write_files, and displayed in some nice visual way, that the recap wouldn't be necessary.
Setting should be system wide
👍
Stats should not show with each task, and this would be extra work to do. "It should just do the right thing"
@chrismeyersfsu - @matburt and I synced up about this earlier and he had asked for per-task profiling information in the job events. Can you say more about your thoughts on having the stats for the entire playbook run?
"It should just do the right thing"
As written, the perf callback plugin writes the profiling information to a set of files, but doesn't put any information in job events. Here's how I'm currently handling that:
https://github.com/jladdjr/ansible-runner/blob/perf_stats/ansible_runner/callbacks/cgroup_perf_recap.py#L48-L55).
We might need to do UI work to make summary at bottom "pretty"
We ran into this same kind of question with the new playbook host summary fields and ultimately decided to just have parity with what users would see when they ran using ansible itself.
https://github.com/ansible/awx/pull/3272#issuecomment-466619365
Update:
(tldr): the way I adopted the perf callback plugin (having it support runner's BaseCallbackModule) is leading to duplicated job events. AWX's serializers only expect one copy of the job events (e.g. here) so that won't work. Think that instead I should go with what was initially discussed on the ticket and have BaseCallbackModule sub-class the perf plugin. Only issue is that the perf plugin's callback methods don't make a call to super. As discussed below, though, this may not be an issue. Going to tinker a bit and see if this works.
@matburt - The way I have it written now, the perf callback plugin is extending runner's BaseCallbackModule. (See https://github.com/jladdjr/ansible-runner/blob/perf_stats/ansible_runner/callbacks/cgroup_perf_recap.py#L38). But what I'm finding is that whenever a BaseCallbackModule object's callback methods get called, a new job event is generated. (For some reason I thought that the job events created by the awx_display callback would be merged with the ones created created by the new callback I created, but instead it creates an entirely new set of job events).
At this point, I think I've come full circle and think it makes sense to do what it sounds like you and @alancoding originally suggested: ".. you mentioned that we would probably subclass the existing AWX callback plugin from this new perf plugin". That would prevent any extra job events from being created.
@alancoding brought up some good points about how the perf plugin is a different callback type than runner's BaseCallbackModule. I suspect that both plugins would still work fine if one subclassed the other, though; BaseCallbackModule would just overwrite CALLBACK_TYPE - setting it to stdout - and that's the only value Ansible would end up seeing when it initialized the callback.
The only problem I currently see with the idea of sub-classing the perf plugin is that the perf plugin's callback methods don't call the parent's methods (i.e. no calls to super):
https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/callback/cgroup_perf_recap.py#L394-L398
If you look at the Ansible base class for callbacks (i.e. the perf callback's parent), a number of the callback methods aren't empty - they make calls to the v1 callback methods for backwards compatibility.
https://access.redhat.com/articles/3382771
@sivel - the perf plugin doesn't close out the files with the profiling data until v2_playbook_on_stats is called. Is there any reason why the files couldn't be closed out after the task finishes?
We'd like to be able to collect the profiling information for each task as soon as we generate the job event for runner_on_ok, runner_on_failed, etc. If we try to read from the files before they're closed though, I'm concerned that we may lose some data that isn't written to the file.
@jladdjr that is likely an oversight. The original implementation did not allow for a file per task, that was added later. If it is only closing in _on_stats then that is a bug. It should likely be happening in _profile
I've opened a PR to address that: https://github.com/ansible/ansible/pull/53961
@sivel - awesome, thx!
@dsesami - quick note that we're planning on adding a toggle on the jobs settings page to enable/disable playbook profiling (CPU, Mem, PID count)
On second thought, think there's a way to just collect the profiling information from the files that get created. That sounds a bit simpler, going to give that a shot.
So far issues to address:
"profiling_data": {}, showing up in list view, should only see it in detail view"profiling_data": {}, showing up for events other than runner_on_ok, runner_item_on_ok, runner_on_failed, runner_item_on_failed@kdelee - have a patch that should get the profiling_data field to show up when we want to and omit it otherwise. Tested locally and confirmed it fixes your first two points ^
https://github.com/jladdjr/awx/commit/5663599d08a8f06d83b930c4fd53462550e48fc5
Discussion with @ryanpetrello around what the use cases for this data are and how to use it yielded these points:
Comparing data from run to run of a playbook is highly subjective, because of the idempotent nature of (well) written playbooks running on live infra, tasks that may take a long time or use many resources in one run will not in another. Additionally, while CPU + memory usage may not be impacted by external processes, the total time a task takes is effected by availability of resources (and such likelyhood that any data will be collected for the task).
@ryanpetrello mentioned that core's intent behind this plugin is to catch regressions. But to do this, they would need to run the same playbooks in controlled and homogeneous environments....but our customers’ environments are not homogenous.
For our users, the most obvious use case we could think of for this data is using this data as a live debugging tool when a user notices that a certain playbook is having troubles and they want insight into what task(s) are causing troubles.
This is one reason why I think it may be better for there to be two toggles for data collection:
One, at system level a sys admin allows performance data collection.
Two, at job-template level a user decides to enable it for specific job template.
As a user, if I wanted to know "What task from job X took the most memory", here's what I might do today if this feature landed as is:
1) Ask sys-admin to enable performance stats collection globally <insert bureaucracy> + re-run
1) Since we cannot display these in the list view because this causing problems for AWX, I would collect all events from job individually and grab performance_stats. _If I happen to know about what individual events we are collecting data for, then maybe I could filter it down, but this a lot of insight into the implementation details._
2) For each event, if it has performance data, get the max or the mean of all the individual measurements (of which there may be none or many)
3) Compare all data to get my most costly event
4) try and correlate this back to a task
This would be a slow and taxing operation on my side. An alternative would be to write some custom database queries, but this would require access to the database + understand our database schema -- which is not realistic.
Maybe I'm way off, but I don't know how to actually use this data/get at it in a way that would help me do anything, thus it is difficult to test because I have no realistic use cases.
Seems like we are rushing to get this in when we still don't have a good idea of how we/users might use the data, and we run the risk of running into more performance problems.
Resounding consensus from QE is that we'd like to have more time to see how it performs on more long-running jobs and jobs with many managed hosts/tasks with with_items loops (since they seem to behave differently than regular tasks). Additionally, we need to have some better use-cases laid out in order to test in a realistic way.
side note -- working on this feature also surfaced some development tooling we need in order to have efficient workflow when we need features in ansible-runner to enable awx changes. Namely, building RPMs from ansible-runner branches and nightly builds of master for testing.
Agree with @kdelee. While the feature does surface the data reported from the perf stats plugin (the original scope of the feature), as it stands the customer would have to go to pretty great lengths to actually take the raw data and make it actionable. I also think it's worth spending some more time thinking about how to we can take the data that's currently being returned and find a way to make it a) easier to get at and b) easier to make sense of. The perf stats plugin, as written, doesn't take care of that for us.
mentioned that core's intent behind this plugin is to catch regressions
While we have many uses for this, the cgroup_perf_callback was written as a tool to diagnose customer performance issues.
Also, big +1 on note about tooling. Being able to easily source ansible-runner from a branch during standalone / cluster builds would really speed up the development cycle.
Note to future QE:
Should think about the implications for logging integration since the performance stats can generate a lot of lines of measurements for long running tasks, and we've seen evidence that some logging aggregators (Splunk) may have trouble with this.
Jumping on from a product standpoint - the goal in a later release is to find a way to visualize this data such that a user can make use of it, and use that to guide how it's stored in the API.
Redacted notes from meeting:
api/v2/jobs/<job_id>/performance@AlanCoding - since it looks like you'll be picking up the API side of this, wanted to mention the branches I had been working on:
https://github.com/jladdjr/awx/tree/perf_stats
https://github.com/jladdjr/ansible-runner/tree/perf_stats
Need more data from UX to find out what data needs to look like
The ask is to show a performance graph for the entire job run. From a UI _implementation_ standpoint the most important things we need given our current scoping and time estimates are:
[1] It takes one (or at the very least a _deterministic_ number of) http requests to get _all_ of the graph data for a job (i.e: the data points themselves aren't paginated)
[2] There are task ids / counters / timestamps associated to the data points in the graph data so that the UI can generate the search queries needed to filter the job events endpoint down to a particular range.
_Alternatively_, if we must paginate for [1], then it seems like the client would need to have some sort of scroll / zoom capability on the graph that generates api requests for a more granular range of events from the performance table based on zoom level, in which case we'd really need to revisit our scoping and time estimates for this.
@kdelee
@AlanCoding
Thoughts? I'm guessing [2] is probably not an issue but I'm wondering about [1] since there's no upper limit on the number of tasks or job events that can belong to a job.
Even if the API did something _really_ fancy and unusual, like evenly sample the performance table to a finite number of data points to send when there's a lot of data, it could still result in some wildly inaccurate graphs for jobs that have a large number of tasks since the client would be interpolating between missing data points that could vary significantly in terms of their values.
Late reply, but [1] is one reason why I'm a strong advocate that the backend must munge this data and provide graph data + fixed granularity via the API
@wenottingham @one-t @jladdjr I know we split this out into multiple issues, but I'm placing this one in test, because https://github.com/ansible/awx/pull/4716 summarizes what we've committed to doing in the near term (and @one-t is actively working on verifying it).
We'll need to revisit the graphs/data collection aspects of this at a later date (especially as we decide how it relates to Automation Insights).
Most helpful comment
While we have many uses for this, the
cgroup_perf_callbackwas written as a tool to diagnose customer performance issues.