I am hoping to accelerate RL algos for one of my problems. Given Apex and IMPALA seems to be the best fit for the cause, I tried the same with two system configurations
(a) 4 CPUs, 1 (GTX 1070) GPU
(b) 40 CPUs, 4 (V100) GPUs - DGX station.
However, in either case, my GPU utilization seems to be quite low even with the examples shipped with Ray (in rllib/tuned_examples and rl-experiments repo). it is always less than 8% (intermittently only) and is 0% most of the time.
Is it expected behavior? There is no mention of GPU utilization levels in the results presented at https://github.com/ray-project/rl-experiments/blob/master/README.md too.
Kindly let me know if I am doing something wrong or anything I can try to improve the GPU utilization and accelerate RL training.
Using FIFO scheduling algorithm.
Resources requested: 4/4 CPUs, 1/1 GPUs, 0.0/8.94 GiB heap, 0.0/3.08 GiB objects
Memory usage on this node: 8.1/15.6 GiB
Result logdir: /home/ankdesh/ray_results/pong-impala-fast
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
- IMPALA_PongNoFrameskip-v4_0: RUNNING, [4 CPUs, 1 GPUs], [pid=15082], 10 s, 1 iter, 15000 ts, -20.6 rew
nvidia-smi
```
ankdesh@6f012718bda7:~$ nvidia-smi
Wed Oct 2 07:11:55 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 0% 47C P8 15W / 200W | 515MiB / 8118MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
@ankdesh I am also facing the same problem, also running Tuned Examples the pong-speedrun/pong-impala-fast.yaml which should run in under 7 mins for 32 CPUs, I ran that for 1 Hour on 10 CPUs, still, it didn't move from a negative reward. Also, my GPU usage is under 8%.
Thu Oct 3 00:04:34 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 2060 Off | 00000000:29:00.0 On | N/A | | 10% 53C P2 38W / 160W | 1865MiB / 5931MiB | 5% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1142 G /usr/lib/xorg/Xorg 18MiB | | 0 1186 G /usr/bin/gnome-shell 72MiB | | 0 1475 G /usr/lib/xorg/Xorg 184MiB | | 0 1621 G /usr/bin/gnome-shell 135MiB | | 0 2007 G ...s/pycharm-community-2019.2/jbr/bin/java 4MiB | | 0 8812 G ...uest-channel-token=17391706427186233250 164MiB | | 0 27564 C /usr/bin/python3 1281MiB | +-----------------------------------------------------------------------------
Can someone please have a look and help out. I need to decide on which framework to use and need to close on the same soon.
No activity on this thread. Closing without resolution.
@ankdesh Hey I did some digging into this and I read through the original paper for ape-x and what I found was that in their setup they said that workers will run on cpu and only one learner network will run on GPU. So if that's how it's implemented in rllib maybe that's why we see a lower usage on GPU . I haven't gone through the ape-x code in rllib but I think that's the case.
@ankdesh Also I wanted to ask you how long did it take to achieve the desired scores in your two setups. I wanted to get an idea and also how much ram is required. For the examples that u have ran on your system (pong).
Most helpful comment
@ankdesh Hey I did some digging into this and I read through the original paper for ape-x and what I found was that in their setup they said that workers will run on cpu and only one learner network will run on GPU. So if that's how it's implemented in rllib maybe that's why we see a lower usage on GPU . I haven't gone through the ape-x code in rllib but I think that's the case.