Ray: [RLlib] Low GPU utilization with Apex and IMPALA

Created on 2 Oct 2019  路  5Comments  路  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • Ray installed from (source or binary): Binary (pip install ray[rllib] )
  • Ray version: 0.7.5
  • Python version: 3.5.2
  • Exact command to reproduce:
    ___rl-experiments___
    $ rllib train -f atari-apex/atari-apex.yaml
    $ rllib train -f pong-speedrun/pong-impala-fast.yaml
    ___rllib/tunes_examples___
    $ rllib train -f tuned_examples/atari-apex.yaml

Describe the problem


I am hoping to accelerate RL algos for one of my problems. Given Apex and IMPALA seems to be the best fit for the cause, I tried the same with two system configurations
(a) 4 CPUs, 1 (GTX 1070) GPU
(b) 40 CPUs, 4 (V100) GPUs - DGX station.
However, in either case, my GPU utilization seems to be quite low even with the examples shipped with Ray (in rllib/tuned_examples and rl-experiments repo). it is always less than 8% (intermittently only) and is 0% most of the time.

Is it expected behavior? There is no mention of GPU utilization levels in the results presented at https://github.com/ray-project/rl-experiments/blob/master/README.md too.

Kindly let me know if I am doing something wrong or anything I can try to improve the GPU utilization and accelerate RL training.

Source code / logs

Using FIFO scheduling algorithm.
Resources requested: 4/4 CPUs, 1/1 GPUs, 0.0/8.94 GiB heap, 0.0/3.08 GiB objects
Memory usage on this node: 8.1/15.6 GiB
Result logdir: /home/ankdesh/ray_results/pong-impala-fast
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - IMPALA_PongNoFrameskip-v4_0: RUNNING, [4 CPUs, 1 GPUs], [pid=15082], 10 s, 1 iter, 15000 ts, -20.6 rew

nvidia-smi
```
ankdesh@6f012718bda7:~$ nvidia-smi
Wed Oct 2 07:11:55 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 0% 47C P8 15W / 200W | 515MiB / 8118MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Most helpful comment

@ankdesh Hey I did some digging into this and I read through the original paper for ape-x and what I found was that in their setup they said that workers will run on cpu and only one learner network will run on GPU. So if that's how it's implemented in rllib maybe that's why we see a lower usage on GPU . I haven't gone through the ape-x code in rllib but I think that's the case.

All 5 comments

@ankdesh I am also facing the same problem, also running Tuned Examples the pong-speedrun/pong-impala-fast.yaml which should run in under 7 mins for 32 CPUs, I ran that for 1 Hour on 10 CPUs, still, it didn't move from a negative reward. Also, my GPU usage is under 8%.

Thu Oct  3 00:04:34 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:29:00.0  On |                  N/A |
| 10%   53C    P2    38W / 160W |   1865MiB /  5931MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1142      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1186      G   /usr/bin/gnome-shell                          72MiB |
|    0      1475      G   /usr/lib/xorg/Xorg                           184MiB |
|    0      1621      G   /usr/bin/gnome-shell                         135MiB |
|    0      2007      G   ...s/pycharm-community-2019.2/jbr/bin/java     4MiB |
|    0      8812      G   ...uest-channel-token=17391706427186233250   164MiB |
|    0     27564      C   /usr/bin/python3                            1281MiB |
+-----------------------------------------------------------------------------

Can someone please have a look and help out. I need to decide on which framework to use and need to close on the same soon.

No activity on this thread. Closing without resolution.

@ankdesh Hey I did some digging into this and I read through the original paper for ape-x and what I found was that in their setup they said that workers will run on cpu and only one learner network will run on GPU. So if that's how it's implemented in rllib maybe that's why we see a lower usage on GPU . I haven't gone through the ape-x code in rllib but I think that's the case.

@ankdesh Also I wanted to ask you how long did it take to achieve the desired scores in your two setups. I wanted to get an idea and also how much ram is required. For the examples that u have ran on your system (pong).

Was this page helpful?
0 / 5 - 0 ratings