PyTorch by default does not disable the autograd and other profilers while training there are a lot of them listed in one of the talks here.
To enhance speed while training it is recommended to turn them off manually.
Since this feature should be on by default as it gives a good stack trace for errors. Can we provide a utility function as we did for seeding everything?
from utils.profiler import profiler
profiler(state=off) # to turn off
profiler(state=on) # to turn on
Something like this will allow ease of toggling the profiler.
Already users can use a context manager using pytorch (but nobody does sadly)
with torch.autograd.profiler.profile(enabled=False) as prof:
PyTorch code
Alternatively, it can be an argument in trainer itself. But I guess this is not good.
Trainer(profiler=off)
Do verify if we are following other recommend guideliness by NVIDIA
Hi! thanks for your contribution!, great first issue!
Does this also include the grad checking and anomaly detection as listed in the linked presentation? If so, we should think of a different name than "profiler", also for the reason that Trainer already has a profiler argument.
Yes, I guess the PDF says we should disable the following
Disable Debug APIs for Final Training
anomaly detection:
torch.autograd.detect_anomaly
torch.autograd.set_detect_anomaly(True)
autogradprofiler: torch.autograd.profiler.profile
automatic NVTX ranges: torch.autograd.profiler.emit_nvtx
autogradgradcheck:
torch.autograd.gradcheck
torch.autograd.gradgradcheck
From docs for anomaly detection
This mode should be enabled only for debugging as the different tests will slow down your program execution.
So yes, we need a different name that profiler. Maybe call it debug API ?
@awaelchli I will have a test run with resnet50 and resnet101 models of torchvision with torch lightning as well as pure PyTorch over CIFAR 10. (I can't run on imagenet it will take a lot of time and its download is somehow restricted).
I will take 5 runs for 50 epochs batch size of 32 fixed, with profiler and 5 without the profiler. Let's see if I get any improvement. I will try on a single GPU only.
Other all flags such as cudnn.determinstic, seeds I will set the same. Let's see if it gives any advantage. As per NVIDIA we should get.
I will update you once done here !
That would be amazing to see these stats!!
Stats after my runs
Model used Resnet50
GPU P100 (Google Colab)
With Profiler
Train batch size = 32
Validation batch size = 8
Time per train step = ~176 Seconds
Time per validatoin step = ~13 seconds
Hence Time per epoch = ~ 190 seconds
Turning off Debug APIs (except the torch.autograd.gradcheck and torch.autograd.gradgradcheck)
Train batch size = 32
Validation batch size = 8
Time per train step = ~105 Seconds
Time per validation step = ~13 seconds
Hence Time per epoch = ~118 seconds.
GPU used was around 1GB as these are small images and model is small too.
Run 2
With Profiler
Train batch size = 256
Validation batch size = 32
Time per train step = ~40 Seconds
Time per validatoin step = ~4.7 seconds
Hence Time per epoch = ~ 44.7 seconds
Turning off Debug APIs (except the torch.autograd.gradcheck and torch.autograd.gradgradcheck)
Train batch size = 256
Validation batch size = 32
Time per train step = ~20 Seconds
Time per validation step = ~4.5 seconds
Hence Time per epoch = ~25 seconds.
GPU used ~4GB (still small)
@awaelchli I hope they will reproduce when you try to run.
Also we might need to test on other GPUs such as V100, T4. Also how it works with DP / DDP and multi GPU cases.
We might need to test on bigger models such as BERT etc.
I will try on Efficient Nets once if needed (b3-b5 as they fit).
It appears to give significant improvement. If this scalers for multi GPU then its quite big.
Do let me know your thoughts !
Run 3
Model used EfficientNet b3
GPU P100 (Google Colab)
With Profiler
Train batch size = 32
Validation batch size = 32
Time per train step = ~460 Seconds
Time per validatoin step = ~15 seconds
Hence Time per epoch = ~ 475 seconds
Turning off Debug APIs (except the torch.autograd.gradcheck and torch.autograd.gradgradcheck)
Train batch size = 32
Validation batch size = 32
Time per train step = ~110 Seconds
Time per validation step = ~7 seconds
Hence Time per epoch = ~117 seconds.
GPU used was around 4GB.
Incredible reduction in time, really cannot understand how it so less. Please have a go with the Colab on a Local GPU too.
wow. this is great.
I just forgot to mention what I did for this.
Just added these 3 lines.
def set_debug_apis(state: bool = False):
torch.autograd.profiler.profile(enabled=state)
torch.autograd.profiler.emit_nvtx(enabled=state)
torch.autograd.set_detect_anomaly(mode=state)
# Then in training code before the train loop
set_debug_apis(state=False)
@oke-aditya I used your function as you posted it in my research project, and I do not observe any speed up.
Then I looked at the docs and it seems that these functions are actually context managers, so they do nothing if you just use them like this, without using "with". For example, profile is meant to be used like so:
with torch.autograd.profiler.profile():
y = model(x)
...
The "enabled" parameter is there so you can quickly turn it off for this block without the need to comment and unindent the whole code:
with torch.autograd.profiler.profile(enabled=False):
y = model(x)
...
The docs say enabled=True by default, but that just refers to the context manager when it is used. It does not mean the functionality is enabled when you don't use the manager.
Of course, this does not explain your numbers (and I will try to find out why), but it makes more sense now and these nvidia slides are in fact misleading. What they are trying to say is: If you have these context managers in your code, you should disable them for final training. That's all it is. But most people don't even use them, so they already have the "most performant" code and there is no need to disable anything.
Hmmm, Benchmarking is quite tricky.
Maybe my results can go wrong due to CUDA caching some operations and fusing them etc.
Also, the OS usually caches such files and file I/O time too can be one of the reasons for my timings,
I will have a go one more time. But your thoughts too are correct.
These context managers are ON by default. I'm not sure if PyTorch internally uses them. E.g.
y = model(x) # torch.grad() was active we don't even know ! Hence takes time
with torch.no_grad(): # We disable this context manager hence next line is faster.
y = model(x)
Applying same logic
with torch.autograd.profiler.profile(enabled=False):
y = model(x)
# This might be faster.
# Since profiler is enabled by default and in the background some accumulation might be happening.
Still unsure of the above, the context manager might be working slightly different against no_grad().
Though your thoughts make more sense to me. Maybe I misinterpreted what they mean.
I will have one more run of my colab. Taking care of these things.
It's really hard to figure out if I got speedup due to exact these three lines or there was some other reason.
I don鈥檛 think the benchmarking done here is valid (although there may indeed be a big speed up!). You need to manually call torch.cuda.synchronize() or else cuda operations may not finish and just be asynchronously dispatched. See discussion in https://github.com/pytorch/pytorch/issues/8817
Yes @tbenst it might not be correct. I'm not much experienced with CUDA benchmarking.
Can you have a look at both the Colabs / locally and let me know if there can be any improvements?
If possible please let me know your benchmarking results.
Hi, as can be found here , you can check if the anomaly detection profiler is enabled, by running the following few lines of code.
import torch
#聽It should be disabled by default
print(torch.is_anomaly_enabled()) # prints False
And you could manually set it on and off by using the torch.set_anomaly_enabled method, as shown below :
mode: bool = True # False
print(torch.set_anomaly_enabled(mode=mode))
Also, looking here, I noticed you could check if the autograd profiler is enabled or disabled, as shown below:
import torch
#聽It should be disabled by default
print(torch.autograd._profiler_enabled()) # prints False
with torch.autograd.profiler.profile():
print(torch.autograd._profiler_enabled()) # prints True
print(torch.autograd._profiler_enabled()) # prints False
Yeah, I guess they are off by default. But just like we have pl.seed_everything we can have this utility. This would simply ease out turning on / off these options.
Most helpful comment
wow. this is great.