Awx: Collect Ansible performance stats

Created on 17 Dec 2018 · 27Comments · Source: ansible/awx

ISSUE TYPE

Feature Idea

COMPONENT NAME

SUMMARY

@sivel has a perf callback plugin: https://github.com/ansible/ansible/pull/46346

We should integrate/merge this into ours to collect perf stats.

ADDITIONAL INFORMATION

This is for collection only. Visualization would be other issues/initiatives.

Tasks:

[x] Collect perf stats data in Runner.event_callback
[x] Installer should create top-level cgroup
[x] .. and pass the name of the cgroup to runner
[x] perf stats flags should be surfaced in
- [x] Runner.run method
- [x] runner CLI flags
- [x] runner settings file
[x] Create system setting for enabling perf stats (then notify @kialam)
[x] Stats should show up in events from jobs that run on isolated nodes
[x] ansible-runner unit / integration tests
- [ ] .. configured to only run if using ansible 2.8 and able to access top-level cgroup
[x] Env vars set for perf stats should show up in JOB_ENV
[ ] Create dev docs in runner / awx
[ ] Once PR is up, sync up with @tvo318 to confirm cgcreate command and any other feature docs needed
[ ] Once we land on final name for job event field for stats, sync up with @rooftopcellist re: metrics

api medium needs_test enhancement

Source

wenottingham

Most helpful comment

mentioned that core's intent behind this plugin is to catch regressions

While we have many uses for this, the cgroup_perf_callback was written as a tool to diagnose customer performance issues.

sivel on 10 Apr 2019

👍3

All 27 comments

@matburt you mentioned that we would probably subclass the existing AWX callback plugin from this new perf plugin. Alternatively, I think it makes more sense to not inherit one from the other, but just have both enabled. This should involve no changes to the existing callback plugin. One more reason for this is that they are both different plugin types, the perf callback module has CALLBACK_TYPE = 'aggregate', whereas the AWX (and now ansible-runner) callback module is the 'stdout' type.

We specify the standard callback module via ANSIBLE_STDOUT_CALLBACK. I think the perf callback plugin would be enabled via ANSIBLE_CALLBACK_WHITELIST, which is a list.

https://github.com/ansible/ansible/blob/013c42b14fa06ff6355d8a9d52430f4cb1262047/lib/ansible/config/base.yml#L496-L506

Incidentally, this is the same thing I had in mind when I filed https://github.com/ansible/awx/issues/41, it would have similarly been a stats gathering thing.

AlanCoding on 8 Jan 2019

The problem I was worried about is that we need to be able to get the data in our plugin in order to send it along with the rest of the event data.

matburt on 8 Jan 2019

@wenottingham @jladdjr and I met today.
One question that came up is:

How does this get turned on/off? What granularity? On job launch, job template, system wide?
@one-t also pointed out this could cause significant database bloat of on all the time as it may double size of each job event
As a user, I would mostly want to see this if I was actively developing or debugging a playbook. It would create alot of noise that would not be desirable to see on every run.

@kialam looks like we will need to surface this option in the UI some how

kdelee on 14 Mar 2019

Initial idea was a global (system-wide).

I could see it at a per-JT level, but it's not required.

wenottingham on 14 Mar 2019

👍1

Talking with @chrismeyersfsu :

1) Setting should be system wide
2) Stats should not show with each task, and this would be extra work to do. "It should just do the right thing"
3) We might need to do UI work to make summary at bottom "pretty"
4) Not concerned about growth in job event size for DB

kdelee on 14 Mar 2019

The recap output by the cgroup_perf_recap plugin can be disabled. I'd imagine if the data is being collected with write_files, and displayed in some nice visual way, that the recap wouldn't be necessary.

sivel on 14 Mar 2019

Setting should be system wide

👍

Stats should not show with each task, and this would be extra work to do. "It should just do the right thing"

@chrismeyersfsu - @matburt and I synced up about this earlier and he had asked for per-task profiling information in the job events. Can you say more about your thoughts on having the stats for the entire playbook run?

"It should just do the right thing"

As written, the perf callback plugin writes the profiling information to a set of files, but doesn't put any information in job events. Here's how I'm currently handling that:
https://github.com/jladdjr/ansible-runner/blob/perf_stats/ansible_runner/callbacks/cgroup_perf_recap.py#L48-L55).

We might need to do UI work to make summary at bottom "pretty"

We ran into this same kind of question with the new playbook host summary fields and ultimately decided to just have parity with what users would see when they ran using ansible itself.
https://github.com/ansible/awx/pull/3272#issuecomment-466619365

jladdjr on 14 Mar 2019

Update:
(tldr): the way I adopted the perf callback plugin (having it support runner's BaseCallbackModule) is leading to duplicated job events. AWX's serializers only expect one copy of the job events (e.g. here) so that won't work. Think that instead I should go with what was initially discussed on the ticket and have BaseCallbackModule sub-class the perf plugin. Only issue is that the perf plugin's callback methods don't make a call to super. As discussed below, though, this may not be an issue. Going to tinker a bit and see if this works.

@matburt - The way I have it written now, the perf callback plugin is extending runner's BaseCallbackModule. (See https://github.com/jladdjr/ansible-runner/blob/perf_stats/ansible_runner/callbacks/cgroup_perf_recap.py#L38). But what I'm finding is that whenever a BaseCallbackModule object's callback methods get called, a new job event is generated. (For some reason I thought that the job events created by the awx_display callback would be merged with the ones created created by the new callback I created, but instead it creates an entirely new set of job events).

At this point, I think I've come full circle and think it makes sense to do what it sounds like you and @alancoding originally suggested: ".. you mentioned that we would probably subclass the existing AWX callback plugin from this new perf plugin". That would prevent any extra job events from being created.

@alancoding brought up some good points about how the perf plugin is a different callback type than runner's BaseCallbackModule. I suspect that both plugins would still work fine if one subclassed the other, though; BaseCallbackModule would just overwrite CALLBACK_TYPE - setting it to stdout - and that's the only value Ansible would end up seeing when it initialized the callback.

The only problem I currently see with the idea of sub-classing the perf plugin is that the perf plugin's callback methods don't call the parent's methods (i.e. no calls to super):
https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/callback/cgroup_perf_recap.py#L394-L398

If you look at the Ansible base class for callbacks (i.e. the perf callback's parent), a number of the callback methods aren't empty - they make calls to the v1 callback methods for backwards compatibility.
https://access.redhat.com/articles/3382771

jladdjr on 16 Mar 2019

@sivel - the perf plugin doesn't close out the files with the profiling data until v2_playbook_on_stats is called. Is there any reason why the files couldn't be closed out after the task finishes?

We'd like to be able to collect the profiling information for each task as soon as we generate the job event for runner_on_ok, runner_on_failed, etc. If we try to read from the files before they're closed though, I'm concerned that we may lose some data that isn't written to the file.

jladdjr on 16 Mar 2019

@jladdjr that is likely an oversight. The original implementation did not allow for a file per task, that was added later. If it is only closing in _on_stats then that is a bug. It should likely be happening in _profile

sivel on 18 Mar 2019

I've opened a PR to address that: https://github.com/ansible/ansible/pull/53961

sivel on 18 Mar 2019

@sivel - awesome, thx!

jladdjr on 18 Mar 2019

@dsesami - quick note that we're planning on adding a toggle on the jobs settings page to enable/disable playbook profiling (CPU, Mem, PID count)

jladdjr on 18 Mar 2019

👍1

On second thought, think there's a way to just collect the profiling information from the files that get created. That sounds a bit simpler, going to give that a shot.

jladdjr on 18 Mar 2019

So far issues to address:

[ ] "profiling_data": {}, showing up in list view, should only see it in detail view
[ ] "profiling_data": {}, showing up for events other than runner_on_ok, runner_item_on_ok, runner_on_failed, runner_item_on_failed
[ ] examine performance on jobs that run longer than 10 min + compare to w/o performance stats
[ ] examine performance with many jobs running in parallel for more than 10 min (my gut feeling is this is better as a per-job-template setting -- not a system setting. or to make it even more fine grained, a system and a job setting. System allows it to be enabled. Job template setting actually collects metrics on that job template)

kdelee on 10 Apr 2019

@kdelee - have a patch that should get the profiling_data field to show up when we want to and omit it otherwise. Tested locally and confirmed it fixes your first two points ^
https://github.com/jladdjr/awx/commit/5663599d08a8f06d83b930c4fd53462550e48fc5

jladdjr on 10 Apr 2019

Discussion with @ryanpetrello around what the use cases for this data are and how to use it yielded these points:

Comparing data from run to run of a playbook is highly subjective, because of the idempotent nature of (well) written playbooks running on live infra, tasks that may take a long time or use many resources in one run will not in another. Additionally, while CPU + memory usage may not be impacted by external processes, the total time a task takes is effected by availability of resources (and such likelyhood that any data will be collected for the task).

@ryanpetrello mentioned that core's intent behind this plugin is to catch regressions. But to do this, they would need to run the same playbooks in controlled and homogeneous environments....but our customers’ environments are not homogenous.

For our users, the most obvious use case we could think of for this data is using this data as a live debugging tool when a user notices that a certain playbook is having troubles and they want insight into what task(s) are causing troubles.

This is one reason why I think it may be better for there to be two toggles for data collection:
One, at system level a sys admin allows performance data collection.
Two, at job-template level a user decides to enable it for specific job template.

As a user, if I wanted to know "What task from job X took the most memory", here's what I might do today if this feature landed as is:
1) Ask sys-admin to enable performance stats collection globally <insert bureaucracy> + re-run
1) Since we cannot display these in the list view because this causing problems for AWX, I would collect all events from job individually and grab performance_stats. _If I happen to know about what individual events we are collecting data for, then maybe I could filter it down, but this a lot of insight into the implementation details._
2) For each event, if it has performance data, get the max or the mean of all the individual measurements (of which there may be none or many)
3) Compare all data to get my most costly event
4) try and correlate this back to a task

This would be a slow and taxing operation on my side. An alternative would be to write some custom database queries, but this would require access to the database + understand our database schema -- which is not realistic.

Maybe I'm way off, but I don't know how to actually use this data/get at it in a way that would help me do anything, thus it is difficult to test because I have no realistic use cases.

Seems like we are rushing to get this in when we still don't have a good idea of how we/users might use the data, and we run the risk of running into more performance problems.

Resounding consensus from QE is that we'd like to have more time to see how it performs on more long-running jobs and jobs with many managed hosts/tasks with with_items loops (since they seem to behave differently than regular tasks). Additionally, we need to have some better use-cases laid out in order to test in a realistic way.

side note -- working on this feature also surfaced some development tooling we need in order to have efficient workflow when we need features in ansible-runner to enable awx changes. Namely, building RPMs from ansible-runner branches and nightly builds of master for testing.

kdelee on 10 Apr 2019

Agree with @kdelee. While the feature does surface the data reported from the perf stats plugin (the original scope of the feature), as it stands the customer would have to go to pretty great lengths to actually take the raw data and make it actionable. I also think it's worth spending some more time thinking about how to we can take the data that's currently being returned and find a way to make it a) easier to get at and b) easier to make sense of. The perf stats plugin, as written, doesn't take care of that for us.

jladdjr on 10 Apr 2019

mentioned that core's intent behind this plugin is to catch regressions

While we have many uses for this, the cgroup_perf_callback was written as a tool to diagnose customer performance issues.

sivel on 10 Apr 2019

👍3

Also, big +1 on note about tooling. Being able to easily source ansible-runner from a branch during standalone / cluster builds would really speed up the development cycle.

jladdjr on 10 Apr 2019

Note to future QE:
Should think about the implications for logging integration since the performance stats can generate a lot of lines of measurements for long running tasks, and we've seen evidence that some logging aggregators (Splunk) may have trouble with this.

kdelee on 11 Apr 2019

Jumping on from a product standpoint - the goal in a later release is to find a way to visualize this data such that a user can make use of it, and use that to guide how it's stored in the API.

wenottingham on 11 Apr 2019

Redacted notes from meeting:

@matburt 's new idea:
- Performance data stored separate than event data, then we can aggregate data
- Then we can generate average + highwater mark
- Separate table, not on job event table
- If we do this, must deal with it like fact cache data
- @ryanpetrello 's concern: Do not get the distribution of processing this data that we get for free with the callback receiver
- Ryan’s counter offer is pop off data from job event and store in separate table.
- Questions about relationship between job id/job event id and performance data table
- Do not show data in job event api
Matt has question about if we get the performance data on the task after the task
- E.g. on task 2 you get the data for task 1
- Unclear what answer is
Do we want System Level “Allow Performance Data Collection” true/false ? UNANSWERED
Do we want JT level toggle?
- YES
- Maybe even just on individual job launch prompt level
How does this get shown?
- This is more about historical data
- Not real time
- During job run, we don’t have data
- After job completion, would have new url hanging off api/v2/jobs/<job_id>/performance
- Would return data
- Able to correlate peaks on graph with task? UNANSWERED
- Need more data from UX to find out what data needs to look like
- Who gets to see this? --> any one who can see job details page

kdelee on 22 May 2019

👍1

@AlanCoding - since it looks like you'll be picking up the API side of this, wanted to mention the branches I had been working on:
https://github.com/jladdjr/awx/tree/perf_stats
https://github.com/jladdjr/ansible-runner/tree/perf_stats

jladdjr on 23 May 2019

Need more data from UX to find out what data needs to look like

The ask is to show a performance graph for the entire job run. From a UI _implementation_ standpoint the most important things we need given our current scoping and time estimates are:

[1] It takes one (or at the very least a _deterministic_ number of) http requests to get _all_ of the graph data for a job (i.e: the data points themselves aren't paginated)
[2] There are task ids / counters / timestamps associated to the data points in the graph data so that the UI can generate the search queries needed to filter the job events endpoint down to a particular range.

_Alternatively_, if we must paginate for [1], then it seems like the client would need to have some sort of scroll / zoom capability on the graph that generates api requests for a more granular range of events from the performance table based on zoom level, in which case we'd really need to revisit our scoping and time estimates for this.

@kdelee
@AlanCoding
Thoughts? I'm guessing [2] is probably not an issue but I'm wondering about [1] since there's no upper limit on the number of tasks or job events that can belong to a job.

Even if the API did something _really_ fancy and unusual, like evenly sample the performance table to a finite number of data points to send when there's a lot of data, it could still result in some wildly inaccurate graphs for jobs that have a large number of tasks since the client would be interpolating between missing data points that could vary significantly in terms of their values.

jakemcdermott on 28 May 2019

Late reply, but [1] is one reason why I'm a strong advocate that the backend must munge this data and provide graph data + fixed granularity via the API

kdelee on 18 Jul 2019

@wenottingham @one-t @jladdjr I know we split this out into multiple issues, but I'm placing this one in test, because https://github.com/ansible/awx/pull/4716 summarizes what we've committed to doing in the near term (and @one-t is actively working on verifying it).

We'll need to revisit the graphs/data collection aspects of this at a later date (especially as we decide how it relates to Automation Insights).