Ignite: Improve or New time profiling utility

Created on 3 Oct 2020  路  10Comments  路  Source: pytorch/ignite

馃殌 Feature

Currently, ignite provides only a basic time profiling tool : https://pytorch.org/ignite/master/contrib/handlers.html#ignite.contrib.handlers.time_profilers.BasicTimeProfiler

Which can print something like

----------------------------------------------------
| Time profiling stats (in seconds):                 |
 ----------------------------------------------------
total  |  min/index  |  max/index  |  mean  |  std

Processing function:
157.46292 | 0.01452/1501 | 0.26905/0 | 0.07730 | 0.01258

Dataflow:
6.11384 | 0.00008/1935 | 0.28461/1551 | 0.00300 | 0.02693

Event handlers:
2.82721

- Events.STARTED: []
0.00000

- Events.EPOCH_STARTED: []
0.00006 | 0.00000/0 | 0.00000/17 | 0.00000 | 0.00000

- Events.ITERATION_STARTED: ['PiecewiseLinear']
0.03482 | 0.00001/188 | 0.00018/679 | 0.00002 | 0.00001

- Events.ITERATION_COMPLETED: ['TerminateOnNan']
0.20037 | 0.00006/866 | 0.00089/1943 | 0.00010 | 0.00003

- Events.EPOCH_COMPLETED: ['empty_cuda_cache', 'training.<locals>.log_elapsed_time', ]
2.57860 | 0.11529/0 | 0.14977/13 | 0.12893 | 0.00790

- Events.COMPLETED: []
not yet triggered

What would be nice to have:

  • more compact representation (maybe inspire from https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.profile)
  • profile handlers and not events. As we are looking for bottlenecks it would be helpful to know which functions take time and not on which event.
  • better representation of data prep time vs training time
enhancement help wanted

All 10 comments

Hi @vfdev-5, I would love to work on this. Could you please assign it to me?

@harsh8398 sure ! Thanks !

Hello @vfdev-5, Sorry for lagging on this. So, I have some questions:

profile handlers and not events. As we are looking for bottlenecks it would be helpful to know which functions take time and not on which event.

We want to add profiling for event handlers attached to engine/events instead of events? Essentially this would be break-down of time taken by handlers attached to events if I understand correctly.

better representation of data prep time vs training time

Is this something we can achieve based on events? For example, aggregate diff between GET_BATCH_STARTED and GET_BATCH_COMPLETED?

Hi @harsh8398

We want to add profiling for event handlers attached to engine/events instead of events? Essentially this would be break-down of time taken by handlers attached to events if I understand correctly.

Yes, something like that. From the user's point of view, showing

- Events.EPOCH_COMPLETED: ['empty_cuda_cache', 'training.<locals>.log_elapsed_time', ]
2.57860 | 0.11529/0 | 0.14977/13 | 0.12893 | 0.00790

that handlers like empty_cuda_cache and training.<locals>.log_elapsed_time have taken together 0.12893 seconds on average does not tell who takes that time. As we can have one slow and another fast.
Naively, i'd list all handlers for all events and measure time for the handlers and display a table with handlers and time. Maybe something like that (but properly aligned):

-------
| name | total (s) | mean (s) | std(s) |
empty_cuda_cache (EPOCH_COMPLETED)                                |   2.57860 | 0.12893 | 0.00790
training.<locals>.log_elapsed_time (EPOCH_COMPLETED)      |  2.57860 |  0.12893 | 0.00790
-------

Is this something we can achieve based on events? For example, aggregate diff between GET_BATCH_STARTED and GET_BATCH_COMPLETED?

Either like that or keep the previous way to measure. Pick the one you think the simplest.

I'm done with profiler implementation. The output looks as follows:
image

You can review the changes and suggest if any modifications required but there's still some TODOs left before raising the PR. Which are as follows:

  • separate rows for Total, Dataflow and Processing stats
  • type hints
  • update write_results() as per new profiler stats format
  • update tests

I have one question, should we include handler times for GET_BATCH_STARTED and GET_BATCH_COMPLETED for calculating dataflow time? Previous implementation had this but I'm not sure if it was intentional.

@harsh8398 table looks very cool 馃憤 Thanks a lot for working on that !
If you'd like to send a draft, please feel free to do that.

I'm just thinking about BC compatibility and maybe it would be better to keep 2 classes and name your version as TimeProfiler or HandlersTimeProfiler or another better name.

I have one question, should we include handler times for GET_BATCH_STARTED and GET_BATCH_COMPLETED for calculating dataflow time? Previous implementation had this but I'm not sure if it was intentional.

Probably, it would be cool ! And even (in another PR) if we could also handle customly added events and measure the time in attached handlers.

@vfdev-5 Okay then for BC it seems the best option is to add a new class as HandlersTimeProfiler. I'll do that.

For handling custom events, its a bit tricky because like epoch and iteration event handlers we don't know how many times they will get triggered. We can use something like this (https://discuss.pytorch.org/t/dynamically-extend-the-tensor/39553/2) for dynamically extending tensor whenever the handler is called. But I think it would get called so many times and it may consume time.

@vfdev-5 Okay then for BC it seems the best option is to add a new class as HandlersTimeProfiler. I'll do that.

@harsh8398 sounds good !

For handling custom events, its a bit tricky because like epoch and iteration event handlers we don't know how many times they will get triggered. We can use something like this (https://discuss.pytorch.org/t/dynamically-extend-the-tensor/39553/2) for dynamically extending tensor whenever the handler is called. But I think it would get called so many times and it may consume time.

Let's see this a bit later. I think handling custom events should not be too different from original events. Engine is aware of all events it could trigger and which handlers...
If you could open a follow-up issue on that, it would be perfect :)

@vfdev-5 raised the draft PR (https://github.com/pytorch/ignite/pull/1398). Let me know if any suggestions or changes needed.

@harsh8398 thanks, I'll try to take a look today.

Was this page helpful?
0 / 5 - 0 ratings