Ray: [rllib] [Feature Request] PyTorch version of DQN style algorithms

Created on 14 Mar 2019 · 31Comments · Source: ray-project/ray

Hi,

I think it would be nice to have a PyTorch version of DQN family of algorithms (particularly the distributed ones). As far as I am aware there's no distributed implementation of DQN algorithms (ApeX) in PyTorch out there, so including them would be tremendously useful!

P1 enhancement rllib

Source

ankeshanand

Most helpful comment

Hi, I'll have a look as well if this helps

szymonWojdat on 15 Mar 2019

👍3

All 31 comments

One approach may be to port an existing PyTorch DQN implementation to the torch policy graph abstraction (assuming compatible licensing). Any idea on possible reference impls here?

ericl on 15 Mar 2019

I think @Kaixhin's implementation of DQN style algorithms is pretty comprehensive: https://github.com/Kaixhin/Rainbow.

There's an official PyTorch tutorial with a minimal DQN implementation as well: https://github.com/pytorch/tutorials/blob/5fff87419e157bbc3fd73cfac1f6e2e0477470e8/intermediate_source/reinforcement_q_learning.py

ankeshanand on 15 Mar 2019

Ah, I did try porting Kai's Rainbow a while ago. This was back before we had proper policy abstractions: https://github.com/Kaixhin/Rainbow/compare/master...ericl:rllib-example

I also recall it didn't achieve the same performance for some reason, likely a bug introduced during the porting. Also, the code has changed a lot since then so it probably makes sense to start fresh.

I personally wouldn't be able to get to this soon, but if you have time to pick this up, I think the way to go would be to move the pytorch Rainbow code into a subclass of PolicyGraph https://github.com/ray-project/ray/blob/master/python/ray/rllib/evaluation/policy_graph.py and implement compute_actions(), learn_on_batch(), and add update_target(). That would be enough to plug into the basic DQN agent in RLlib, and then the only missing piece would be td_error handling to run in the Ape-X optimizer.

ericl on 15 Mar 2019

Hi, I'll have a look as well if this helps

szymonWojdat on 15 Mar 2019

👍3

@szymonWojdat that'd be great! feel free to open a WIP PR as soon as possible, and tag us if you have questions about the codebase/implementation.

richardliaw on 15 Mar 2019

👍1

@ericl @szymonWojdat @richardliaw In my opinion, every algorithm should support both pytorch and tensorflow (if at all reasonably possible).

A lot of the DQN code will have to be refactored to fit the TF 2.0 API anyway (and the TF 1.0 object-oriented/eager API). Given there obviously is interest in pytorch models for DQN algorithms, we might be better off trying to do this right and design a set of APIs that completely separates the RL algorithm from the learning framework. I don't think it would be too difficult or clunky to achieve but it will require a carefully thought out design if we hope to be capable of accommodating future algorithms.

I think a good first step would be to gather all the different uses of models and policies, and any other classes that uses tensorflow or pytorch specific API calls. Once we have an idea of what type of operations we need to support, we should be able to figure out the best way to split up and isolate the TF/pytorch specific calls in a way that feels natural. What do you think?

gehring on 19 Mar 2019

👍2

@gehring I think we already have pretty good isolation here, is that not the case? For example, PG and A2C/A3C work in both PyTorch / TF without much effort.

The only cases I can think of where we have tight coupling with TF is for the multi-GPU optimizer, which is hard to avoid since it's a performance-critical component.

ericl on 19 Mar 2019

Sorry for the lack of update/commits recently, got stuck on something IRL, should be able to commit to this next week.

szymonWojdat on 28 Mar 2019

I saw some commits... looks like a good start, but @szymonWojdat were you planning on porting Kaixhin's Rainbow or writing your own? I think it makes sense to reuse code as much as possible.

ericl on 12 Apr 2019

Thanks! I haven't tried porting it yet, I'll have a look. Been mostly looking around the project and trying to implement some abstract methods of PolicyGraph so far.

Just curious - why aren't all abstract methods of PolicyGraph implemented in most inheriting classes, eg. QMixPolicyGraph? Asking as I've been wondering which abstract methods should be good to skip, is there any way of finding out other than running some tests?

szymonWojdat on 12 Apr 2019

Just curious - why aren't all abstract methods of PolicyGraph implemented in most inheriting classes, eg. QMixPolicyGraph? Asking as I've been wondering which abstract methods should be good to skip, is there any way of finding out other than running some tests?

The only crucial methods are these: {compute_actions, learn_on_batch, get_weights, set_weights}. Intuitively this is because RLlib needs to know how to compute actions to run env rollouts, improve the policy once a batch of rollouts is done, and synchronize weights in the distributed setting. The other methods are sometimes needed depending on your algo but I don't think they are critical for DQN.

ericl on 12 Apr 2019

👍1

Just to make sure I understand the whole process of porting Kaixhin's Rainbow:

Add an RLlib dependency to Rainbow so that it uses RLlib's Evaluators and Trainers
Add a few "run" scripts to RLlib that would use the stuff that just got integrated into Rainbow
Is this correct? I think I might be missing something in what do we exactly add to RLlib (assuming more than just run examples).

szymonWojdat on 15 Apr 2019

I meant adding it to RLlib: you would implement a RainbowTorchPolicyGraph
that wraps the rainbow code, that way it can support running distributed
with Apex. This can be added to the agents/dqn directory similar to pytorch
support for a2c.

On Sun, Apr 14, 2019, 9:41 PM szymonWojdat notifications@github.com wrote:

Just to make sure I understand the whole process of porting Kaixhin's
Rainbow:

Add an RLlib dependency to Rainbow so that it uses RLlib's
Evaluators and Trainers

Add a few "run" scripts to RLlib that would use the stuff that just
got integrated into Rainbow
Is this correct? I think I might be missing something in what do we
exactly add to RLlib (assuming more than just run examples).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-483105249,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6Sroml9pTSHSH-e6Sy5CTlvO3zV48ks5vhAMLgaJpZM4b1YCC
.

ericl on 15 Apr 2019

https://ray.readthedocs.io/en/latest/rllib-concepts.html has an overview of the high level algorithm organization in rllib, of which you'd need to implement only the policy graph component since the rest is already there for DQN.

ericl on 15 Apr 2019

👍1

Thanks for the tips. I've been looking for a way to include Kaixhin's Rainbow as a dependency, any advice on that? I guess RLlib must already have some dependencies that aren't installable via pip, so I an example should be enough. I thought you'd normally put those in python/ray.egg-info/dependency_links.txt

szymonWojdat on 16 Apr 2019

For the purposes of rllib I think the best solution is to do a port, which would mean conforming to the expectations of the API and using my code (and results) as reference.

Kaixhin on 16 Apr 2019

Thanks! Will do

szymonWojdat on 16 Apr 2019

Hi, I noticed that in Rainbow implementation, Agent.learn() (which I assume is the corresponding method to our RainbowTorchPolicyGraph.lean_on_batch()) uses ReplayMemory class. Should I be porting that as well or is there an equivalent class in rllib? I assume there must be

szymonWojdat on 30 Apr 2019

RLlib will take care of replay -- the input to learn_on_batch will be
already the batch sampled from the replay buffer. So no need to worry about
it when defining the policy graph.

On Mon, Apr 29, 2019, 9:35 PM szymonWojdat notifications@github.com wrote:

Hi, I noticed that in Rainbow implementation, Agent.learn() (which I
assume is the corresponding method to our
RainbowTorchPolicyGraph.lean_on_batch()) uses ReplayMemory class. Should I
be porting that as well or is there an equivalent class in rllib? I assume
there must be

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-487818841,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSSVQAW23VS4XTD66RDPS7EAVANCNFSM4G6VQCBA
.

ericl on 30 Apr 2019

I took @szymonWojdat's branch and tried porting this the other day, there's an initial implementation at https://github.com/ankeshanand/ray/blob/master/python/ray/rllib/agents/dqn/rainbow_torch_policy_graph.py.

Somehow, it was pretty slow (I was trying to run an ApeX agent). @ericl Any ApeX specific stuff I should be aware of?

The config I was using was:

config = merge_dicts(
    apex.APEX_DEFAULT_CONFIG,
    {
        "num_workers": 8,
        "use_pytorch": True,
        "num_atoms": 51,
    }
)

ankeshanand on 30 Apr 2019

Hm, could you post an example of the training result? I wonder if the GPU isn't getting used for PyTorch.

Otherwise, I think that config you posted looks reasonable.

ericl on 30 Apr 2019

Here's what I see on a V100 and 4 workers:
rllib train --env=Pong-v0 --run=APEX --config='{"use_pytorch": true, "num_workers": 4, "num_atoms": 51, "optimizer": {"debug": true}, "min_iter_time_s": 5, "timesteps_per_iteration": 5000, "learning_starts": 0}'

    timing_breakdown:
      get_samples_time_ms: 1.076
      learner_dequeue_time_ms: 0.015
      learner_grad_time_ms: 222.353
      put_weights_time_ms: 30.778
      replay_processing_time_ms: 0.701
      sample_processing_time_ms: 0.636
      sample_time_ms: 1.374
      train_time_ms: 1.374
      update_priorities_time_ms: 0.008
    train_throughput: 1862.537

The GPU utilization is about 10%.

For single-threaded DQN execution, the time to learn on batch is slightly faster (178ms vs 222ms) -- not sure why.

rllib train --env=Pong-v0 --run=DQN --config='{"use_pytorch": true, "num_workers": 0, "num_atoms": 51, "learning_starts": 0, "num_gpus": 1, "sample_batch_size": 32, "train_batch_size": 512}'

  info:
    grad_time_ms: 178.668
    learner:
      default_policy: {}
    max_exploration: 0.598592
    min_exploration: 0.598592
    num_steps_sampled: 5120
    num_steps_trained: 81920
    num_target_updates: 10
    opt_peak_throughput: 2865.658
    opt_samples: 512.0
    replay_time_ms: 94.411
    sample_time_ms: 91.891
    update_time_ms: 0.002

ericl on 30 Apr 2019

Strange, so one training iteration of 5000 steps takes about 200 seconds for me, which seems excessive. I am running on a P100 machine with 4 workers (same config as yours, have verified that the GPU is being used, and the machine has 12 physical cores)

     replay_time_ms: 1933.884
      update_priorities_time_ms: 456.061
    sample_throughput: 859.499
    timing_breakdown:
      get_samples_time_ms: 0.564
      learner_dequeue_time_ms: 0.007
      learner_grad_time_ms: 1438.682
      put_weights_time_ms: 51.351
      replay_processing_time_ms: 5.326
      sample_processing_time_ms: 0.477
      sample_time_ms: 5.817
      train_time_ms: 5.817
      update_priorities_time_ms: 0.003
    train_throughput: 0.0
  iterations_since_restore: 3
  num_healthy_workers: 4
  num_metric_batches_dropped: 0
  off_policy_estimator: {}
  pid: 22809
  policy_reward_mean: {}
  sampler_perf:
    mean_env_wait_ms: 19.079955082821105
    mean_inference_ms: 101.28750366182197
    mean_processing_ms: 41.70060372614461
  time_since_restore: 1019.4289181232452
  time_this_iter_s: 215.07506322860718
  time_total_s: 1019.4289181232452
  timestamp: 1556611893
  timesteps_since_restore: 15000
  timesteps_this_iter: 5000
  timesteps_total: 15000
  training_iteration: 3

ankeshanand on 30 Apr 2019

Oh hm, your inference time is 10x what I see: mean_inference_ms: 101.2

Also, the grad time is 7x slower (1400ms).

How fast does Kai's rainbow run for you?

On Tue, Apr 30, 2019, 1:16 AM Ankesh Anand notifications@github.com wrote:

Strange, so one training iteration of 5000 steps takes about 200 seconds
for me, which seems excessive. I am running on a P100 machine with 4
workers (same config as yours, have verified that the GPU is being used,
and the machine has 12 physical cores)
 replay_time_ms: 1933.884
  update_priorities_time_ms: 456.061
sample_throughput: 859.499
timing_breakdown:
  get_samples_time_ms: 0.564
  learner_dequeue_time_ms: 0.007
  learner_grad_time_ms: 1438.682
  put_weights_time_ms: 51.351
  replay_processing_time_ms: 5.326
  sample_processing_time_ms: 0.477
  sample_time_ms: 5.817
  train_time_ms: 5.817
  update_priorities_time_ms: 0.003
train_throughput: 0.0
iterations_since_restore: 3
node_ip: 10.0.0.7
num_healthy_workers: 4
num_metric_batches_dropped: 0
off_policy_estimator: {}
pid: 22809
policy_reward_mean: {}
sampler_perf:
mean_env_wait_ms: 19.079955082821105
mean_inference_ms: 101.28750366182197
mean_processing_ms: 41.70060372614461
time_since_restore: 1019.4289181232452
time_this_iter_s: 215.07506322860718
time_total_s: 1019.4289181232452
timestamp: 1556611893
timesteps_since_restore: 15000
timesteps_this_iter: 5000
timesteps_total: 15000
training_iteration: 3

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-487857929,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSRUSKK3Y27MF2QBKLDPS7545ANCNFSM4G6VQCBA
.

ericl on 30 Apr 2019

I will have to check, but I don't think the machine is the issue: I got similar performance on a V100 machine. I am running in development right now (https://ray.readthedocs.io/en/latest/rllib-dev.html), could that be the issue?

ankeshanand on 30 Apr 2019

@ericl Could there be other sources of performance bottleneck? I built ray from scratch, and looked through https://ray.readthedocs.io/en/latest/troubleshooting.html.

ankeshanand on 1 May 2019

No that I know of. One question is whether inference and backprop work
faster outside of ray or not (i.e., is it a environment issue or Ray
related issue?)

On Wed, May 1, 2019, 11:13 AM Ankesh Anand notifications@github.com wrote:

@ericl https://github.com/ericl Could there be other sources of
performance bottleneck? I built ray from scratch, and looked through
https://ray.readthedocs.io/en/latest/troubleshooting.html.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-488364189,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSQCJ25G7KDREBHVSKTPTHMT7ANCNFSM4G6VQCBA
.

ericl on 1 May 2019

Any progress with that? Can someone share existing code? I'd gladly help here.
@ericl, I saw you ran:

rllib train --env=Pong-v0 --run=DQN --config='{"use_pytorch": true, ...

But I couldn't find any DQN implementation for PyTorch in 0.7.2. Is it a new thing in 0.8 ?

roireshef on 1 Aug 2019

👀1

I'll take this one. We agreed on unifying things in the Policy realm a little such that Agents don't need to care anymore about which backend is used and thus avoid "almost-duplicate" code. Done some preliminary work on PG and it looks ok. Will do DQN next.

sven1977 on 12 Dec 2019

👍1

Working on this now. Preliminary tests look good. Expect this to be fully functional within a week to 10 days.
NOTE: This will include the parameter-noise exploration option, but may not include the distributional head and noisy layers (to be added after the beforementioned initial version).

sven1977 on 13 Mar 2020

❤1

Sorry, a little late, but here we go.
https://github.com/ray-project/ray/pull/7597

I'm closing this issue now.

sven1977 on 5 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings