Pytorch-lightning: Adding NVIDIA-SMI like information

Created on 4 Jun 2020  路  9Comments  路  Source: PyTorchLightning/pytorch-lightning

馃殌 Feature

  • Add the GPU usage information during training.

Motivation

Most of the research is done on HPC. Therefore, if I want to see the GPU RAM and usage of my job, I have to open a secondary screen to run "watch nvidia-smi" or "nvidia-smi dmon".
Have this info saved in the logs will help to:

  1. See if I have space for larger batches
  2. Report the correct resources needed to replicate my experiment.

Pitch

When training starts, report the GPU RAM and the GPU usage together with loss and v_num

Alternatives

After the first epoch is loaded into the GPU, log the GPU RAM and the GPU usage

Additional context

enhancement good first issue help wanted let's do it!

All 9 comments

Hi! thanks for your contribution!, great first issue!

add 1) you can use Batch finder
add 2) how it is different from a logger?
cc: @jeremyjordan @SkafteNicki

If your goal is just to optimize for batch size, then the batch finder may be what you are looking for.
If we where to log the resource usage, I guess that we could write a callback similar to LearningRateLogger that extracts this information (through nvidia-smi or maybe gpustat?) and logs these numbers to the logger of the trainer.

Hi @SkafteNicki @Borda ,
In fact, what I am looking are both things. First, optimize the batch size that fits in my GPU and then to keep logging the GPU usage.
Recently, I come across with https://github.com/sicara/gpumonitor that implements a pl's callback.
I will test if this gpumonitor has what I was looking for.
Thank you for your comments

@groadabike mind send a PR with PL callback?

Hi @Borda , I try to use the gpumonitor callback but it didn't work in my HPC.
For some reason, the training stop waiting for something.
I can't send a PR as I don't have any callback implemented.
I am still in need to know the GPU utilisation because I know I have a bottleneck in the dataloader (using the profiler), but I don't know for how long the GPU is waiting for the next batch.
Will back to you when I solve this issue

Hi Borda,
I have a first attempt for my Callback to measure the GPU usage and GPU "dead" periods.
Can you take a look at it and give me your feedback?
I am doing several measures and logging in tensorboard:

1- Time between batches - the time between the end of one batch and the start of the next.
2- Time in batch - the time between the start and end of one batch
Screenshot from 2020-08-11 12-51-21

3- GPU utilisation - % of GPU utilisation measured at the beginning and end of each batch.
Screenshot from 2020-08-11 12-52-17

4- GPU memory used
Screenshot from 2020-08-11 12-53-46

5- GPU memory free
Screenshot from 2020-08-11 12-53-32

gpuusage_callback.zip

@groadabike i think that looks like a great addition. I you want to submit a PR, feel free :)
Personally I would also add flags for temperature (temperature.gpu and temperature.memory query) and fans (fan.speed query), both can be disabled by default. For the memory_utilization flag I would also log utilization.memory.

Was this page helpful?
0 / 5 - 0 ratings