Pytorch-lightning: Turn off torch profilers for faster training

Created on 13 Sep 2020 · 17Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Motivation

PyTorch by default does not disable the autograd and other profilers while training there are a lot of them listed in one of the talks here.
To enhance speed while training it is recommended to turn them off manually.

Pitch

Since this feature should be on by default as it gives a good stack trace for errors. Can we provide a utility function as we did for seeding everything?

from utils.profiler import profiler
profiler(state=off)   # to turn off
profiler(state=on)  # to turn on

Something like this will allow ease of toggling the profiler.

Alternatives

Already users can use a context manager using pytorch (but nobody does sadly)

with torch.autograd.profiler.profile(enabled=False) as prof:
    PyTorch code

Alternatively, it can be an argument in trainer itself. But I guess this is not good.

Trainer(profiler=off)

Additional context

Do verify if we are following other recommend guideliness by NVIDIA

szymon_migacz-pytorch-performance-tuning-guide.pdf

enhancement help wanted

Source

oke-aditya

Most helpful comment

wow. this is great.

edenlightning on 22 Sep 2020

🚀1 🎉1

All 17 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 13 Sep 2020

Does this also include the grad checking and anomaly detection as listed in the linked presentation? If so, we should think of a different name than "profiler", also for the reason that Trainer already has a profiler argument.

awaelchli on 14 Sep 2020

Yes, I guess the PDF says we should disable the following

Disable Debug APIs for Final Training

anomaly detection: 
torch.autograd.detect_anomaly
torch.autograd.set_detect_anomaly(True)

autogradprofiler: torch.autograd.profiler.profile

automatic NVTX ranges:  torch.autograd.profiler.emit_nvtx

autogradgradcheck: 
torch.autograd.gradcheck
torch.autograd.gradgradcheck

From docs for anomaly detection

This mode should be enabled only for debugging as the different tests will slow down your program execution.

So yes, we need a different name that profiler. Maybe call it debug API ?

oke-aditya on 14 Sep 2020

@awaelchli I will have a test run with resnet50 and resnet101 models of torchvision with torch lightning as well as pure PyTorch over CIFAR 10. (I can't run on imagenet it will take a lot of time and its download is somehow restricted).

I will take 5 runs for 50 epochs batch size of 32 fixed, with profiler and 5 without the profiler. Let's see if I get any improvement. I will try on a single GPU only.

Other all flags such as cudnn.determinstic, seeds I will set the same. Let's see if it gives any advantage. As per NVIDIA we should get.

I will update you once done here !

oke-aditya on 20 Sep 2020

❤1

That would be amazing to see these stats!!

awaelchli on 20 Sep 2020

👍1

Here is the Colab where I'm doing benchmarking.

Feel free to have a go with all parameters!

The engine training script is here

I will try batch sizes of 32 and 256 since it is not having a huge GPU load in 32.

oke-aditya on 21 Sep 2020

Stats after my runs

Model used Resnet50
GPU P100 (Google Colab)

With Profiler 
Train batch size = 32
Validation batch size = 8

Time per train step = ~176 Seconds
Time per validatoin step = ~13 seconds
Hence Time per epoch = ~ 190 seconds

Turning off Debug APIs (except the torch.autograd.gradcheck and torch.autograd.gradgradcheck)

Train batch size = 32
Validation batch size = 8

Time per train step = ~105 Seconds
Time per validation step = ~13 seconds 
Hence Time per epoch = ~118 seconds.

GPU used was around 1GB as these are small images and model is small too.

Run 2

With Profiler 
Train batch size = 256
Validation batch size = 32

Time per train step = ~40 Seconds
Time per validatoin step = ~4.7 seconds
Hence Time per epoch = ~ 44.7 seconds

Turning off Debug APIs (except the torch.autograd.gradcheck and torch.autograd.gradgradcheck)

Train batch size = 256
Validation batch size = 32

Time per train step = ~20 Seconds
Time per validation step = ~4.5 seconds 
Hence Time per epoch = ~25 seconds.

GPU used ~4GB (still small)

@awaelchli I hope they will reproduce when you try to run.
Also we might need to test on other GPUs such as V100, T4. Also how it works with DP / DDP and multi GPU cases.
We might need to test on bigger models such as BERT etc.
I will try on Efficient Nets once if needed (b3-b5 as they fit).
It appears to give significant improvement. If this scalers for multi GPU then its quite big.

Do let me know your thoughts !

oke-aditya on 21 Sep 2020

Run 3

Model used EfficientNet b3
GPU P100 (Google Colab)

With Profiler 
Train batch size = 32
Validation batch size = 32

Time per train step = ~460 Seconds
Time per validatoin step = ~15 seconds
Hence Time per epoch = ~ 475 seconds

Turning off Debug APIs (except the torch.autograd.gradcheck and torch.autograd.gradgradcheck)

Train batch size = 32
Validation batch size = 32

Time per train step = ~110 Seconds
Time per validation step = ~7 seconds 
Hence Time per epoch = ~117 seconds.

GPU used was around 4GB.

Incredible reduction in time, really cannot understand how it so less. Please have a go with the Colab on a Local GPU too.

oke-aditya on 22 Sep 2020

wow. this is great.

edenlightning on 22 Sep 2020

🚀1 🎉1

I just forgot to mention what I did for this.
Just added these 3 lines.

def set_debug_apis(state: bool = False):
    torch.autograd.profiler.profile(enabled=state)
    torch.autograd.profiler.emit_nvtx(enabled=state)
    torch.autograd.set_detect_anomaly(mode=state)

# Then in training code before the train loop 
set_debug_apis(state=False)

oke-aditya on 22 Sep 2020

@oke-aditya I used your function as you posted it in my research project, and I do not observe any speed up.
Then I looked at the docs and it seems that these functions are actually context managers, so they do nothing if you just use them like this, without using "with". For example, profile is meant to be used like so:

with torch.autograd.profiler.profile():
    y = model(x)
    ...

The "enabled" parameter is there so you can quickly turn it off for this block without the need to comment and unindent the whole code:

with torch.autograd.profiler.profile(enabled=False):
    y = model(x)
    ...

The docs say enabled=True by default, but that just refers to the context manager when it is used. It does not mean the functionality is enabled when you don't use the manager.
Of course, this does not explain your numbers (and I will try to find out why), but it makes more sense now and these nvidia slides are in fact misleading. What they are trying to say is: If you have these context managers in your code, you should disable them for final training. That's all it is. But most people don't even use them, so they already have the "most performant" code and there is no need to disable anything.

awaelchli on 23 Sep 2020

Hmmm, Benchmarking is quite tricky.
Maybe my results can go wrong due to CUDA caching some operations and fusing them etc.
Also, the OS usually caches such files and file I/O time too can be one of the reasons for my timings,

I will have a go one more time. But your thoughts too are correct.

These context managers are ON by default. I'm not sure if PyTorch internally uses them. E.g.

y = model(x)                # torch.grad() was active we don't even know ! Hence takes time

with torch.no_grad(): # We disable this context manager hence next line is faster.
   y = model(x)

Applying same logic
with torch.autograd.profiler.profile(enabled=False):
    y = model(x) 

# This might be faster.
# Since profiler is enabled by default and in the background some accumulation might be happening.

Still unsure of the above, the context manager might be working slightly different against no_grad().
Though your thoughts make more sense to me. Maybe I misinterpreted what they mean.

I will have one more run of my colab. Taking care of these things.

It's really hard to figure out if I got speedup due to exact these three lines or there was some other reason.

oke-aditya on 23 Sep 2020

❤1

Maybe we can run this time In 2 separate Colab sessions so solving caching issues.

Profiler enabled training colab
Profiler Disabled training colab

oke-aditya on 23 Sep 2020

I don’t think the benchmarking done here is valid (although there may indeed be a big speed up!). You need to manually call torch.cuda.synchronize() or else cuda operations may not finish and just be asynchronously dispatched. See discussion in https://github.com/pytorch/pytorch/issues/8817

tbenst on 20 Oct 2020

Yes @tbenst it might not be correct. I'm not much experienced with CUDA benchmarking.
Can you have a look at both the Colabs / locally and let me know if there can be any improvements?

If possible please let me know your benchmarking results.

oke-aditya on 20 Oct 2020

Hi, as can be found here , you can check if the anomaly detection profiler is enabled, by running the following few lines of code.

import torch
# It should be disabled by default
print(torch.is_anomaly_enabled())  # prints False

And you could manually set it on and off by using the torch.set_anomaly_enabled method, as shown below :

mode: bool = True  # False
print(torch.set_anomaly_enabled(mode=mode))

Also, looking here, I noticed you could check if the autograd profiler is enabled or disabled, as shown below:

import torch

# It should be disabled by default
print(torch.autograd._profiler_enabled())  # prints False

with torch.autograd.profiler.profile():
    print(torch.autograd._profiler_enabled())  # prints True

print(torch.autograd._profiler_enabled()) # prints False

diegovalenzuelaiturra on 5 Nov 2020

Yeah, I guess they are off by default. But just like we have pl.seed_everything we can have this utility. This would simply ease out turning on / off these options.

oke-aditya on 5 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings