Pytorch-lightning: Adding NVIDIA-SMI like information

Created on 4 Jun 2020 · 9Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Add the GPU usage information during training.

Motivation

Most of the research is done on HPC. Therefore, if I want to see the GPU RAM and usage of my job, I have to open a secondary screen to run "watch nvidia-smi" or "nvidia-smi dmon".
Have this info saved in the logs will help to:

See if I have space for larger batches
Report the correct resources needed to replicate my experiment.

Pitch

When training starts, report the GPU RAM and the GPU usage together with loss and v_num

Alternatives

After the first epoch is loaded into the GPU, log the GPU RAM and the GPU usage

Additional context

enhancement good first issue help wanted let's do it!

Source

groadabike

All 9 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 4 Jun 2020

add 1) you can use Batch finder
add 2) how it is different from a logger?
cc: @jeremyjordan @SkafteNicki

Borda on 11 Jun 2020

If your goal is just to optimize for batch size, then the batch finder may be what you are looking for.
If we where to log the resource usage, I guess that we could write a callback similar to LearningRateLogger that extracts this information (through nvidia-smi or maybe gpustat?) and logs these numbers to the logger of the trainer.

SkafteNicki on 11 Jun 2020

Hi @SkafteNicki @Borda ,
In fact, what I am looking are both things. First, optimize the batch size that fits in my GPU and then to keep logging the GPU usage.
Recently, I come across with https://github.com/sicara/gpumonitor that implements a pl's callback.
I will test if this gpumonitor has what I was looking for.
Thank you for your comments

groadabike on 12 Jun 2020

👍1

@groadabike mind send a PR with PL callback?

Borda on 4 Aug 2020

Hi @Borda , I try to use the gpumonitor callback but it didn't work in my HPC.
For some reason, the training stop waiting for something.
I can't send a PR as I don't have any callback implemented.
I am still in need to know the GPU utilisation because I know I have a bottleneck in the dataloader (using the profiler), but I don't know for how long the GPU is waiting for the next batch.
Will back to you when I solve this issue

groadabike on 7 Aug 2020

Hi Borda,
I have a first attempt for my Callback to measure the GPU usage and GPU "dead" periods.
Can you take a look at it and give me your feedback?
I am doing several measures and logging in tensorboard:

1- Time between batches - the time between the end of one batch and the start of the next.
2- Time in batch - the time between the start and end of one batch
Screenshot from 2020-08-11 12-51-21

3- GPU utilisation - % of GPU utilisation measured at the beginning and end of each batch.
Screenshot from 2020-08-11 12-52-17

4- GPU memory used
Screenshot from 2020-08-11 12-53-46

5- GPU memory free
Screenshot from 2020-08-11 12-53-32

gpuusage_callback.zip

groadabike on 11 Aug 2020

@groadabike i think that looks like a great addition. I you want to submit a PR, feel free :)
Personally I would also add flags for temperature (temperature.gpu and temperature.memory query) and fans (fan.speed query), both can be disabled by default. For the memory_utilization flag I would also log utilization.memory.

SkafteNicki on 11 Aug 2020

👍1

closing this as it was solved by PR https://github.com/PyTorchLightning/pytorch-lightning/pull/2932

SkafteNicki on 23 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings