Hi,
I think it would be nice to have a PyTorch version of DQN family of algorithms (particularly the distributed ones). As far as I am aware there's no distributed implementation of DQN algorithms (ApeX) in PyTorch out there, so including them would be tremendously useful!
One approach may be to port an existing PyTorch DQN implementation to the torch policy graph abstraction (assuming compatible licensing). Any idea on possible reference impls here?
I think @Kaixhin's implementation of DQN style algorithms is pretty comprehensive: https://github.com/Kaixhin/Rainbow.
There's an official PyTorch tutorial with a minimal DQN implementation as well: https://github.com/pytorch/tutorials/blob/5fff87419e157bbc3fd73cfac1f6e2e0477470e8/intermediate_source/reinforcement_q_learning.py
Ah, I did try porting Kai's Rainbow a while ago. This was back before we had proper policy abstractions: https://github.com/Kaixhin/Rainbow/compare/master...ericl:rllib-example
I also recall it didn't achieve the same performance for some reason, likely a bug introduced during the porting. Also, the code has changed a lot since then so it probably makes sense to start fresh.
I personally wouldn't be able to get to this soon, but if you have time to pick this up, I think the way to go would be to move the pytorch Rainbow code into a subclass of PolicyGraph https://github.com/ray-project/ray/blob/master/python/ray/rllib/evaluation/policy_graph.py and implement compute_actions(), learn_on_batch(), and add update_target(). That would be enough to plug into the basic DQN agent in RLlib, and then the only missing piece would be td_error handling to run in the Ape-X optimizer.
Hi, I'll have a look as well if this helps
@szymonWojdat that'd be great! feel free to open a WIP PR as soon as possible, and tag us if you have questions about the codebase/implementation.
@ericl @szymonWojdat @richardliaw In my opinion, every algorithm should support both pytorch and tensorflow (if at all reasonably possible).
A lot of the DQN code will have to be refactored to fit the TF 2.0 API anyway (and the TF 1.0 object-oriented/eager API). Given there obviously is interest in pytorch models for DQN algorithms, we might be better off trying to do this right and design a set of APIs that completely separates the RL algorithm from the learning framework. I don't think it would be too difficult or clunky to achieve but it will require a carefully thought out design if we hope to be capable of accommodating future algorithms.
I think a good first step would be to gather all the different uses of models and policies, and any other classes that uses tensorflow or pytorch specific API calls. Once we have an idea of what type of operations we need to support, we should be able to figure out the best way to split up and isolate the TF/pytorch specific calls in a way that feels natural. What do you think?
@gehring I think we already have pretty good isolation here, is that not the case? For example, PG and A2C/A3C work in both PyTorch / TF without much effort.
The only cases I can think of where we have tight coupling with TF is for the multi-GPU optimizer, which is hard to avoid since it's a performance-critical component.
Sorry for the lack of update/commits recently, got stuck on something IRL, should be able to commit to this next week.
I saw some commits... looks like a good start, but @szymonWojdat were you planning on porting Kaixhin's Rainbow or writing your own? I think it makes sense to reuse code as much as possible.
Thanks! I haven't tried porting it yet, I'll have a look. Been mostly looking around the project and trying to implement some abstract methods of PolicyGraph so far.
Just curious - why aren't all abstract methods of PolicyGraph implemented in most inheriting classes, eg. QMixPolicyGraph? Asking as I've been wondering which abstract methods should be good to skip, is there any way of finding out other than running some tests?
Just curious - why aren't all abstract methods of PolicyGraph implemented in most inheriting classes, eg. QMixPolicyGraph? Asking as I've been wondering which abstract methods should be good to skip, is there any way of finding out other than running some tests?
The only crucial methods are these: {compute_actions, learn_on_batch, get_weights, set_weights}. Intuitively this is because RLlib needs to know how to compute actions to run env rollouts, improve the policy once a batch of rollouts is done, and synchronize weights in the distributed setting. The other methods are sometimes needed depending on your algo but I don't think they are critical for DQN.
Just to make sure I understand the whole process of porting Kaixhin's Rainbow:
I meant adding it to RLlib: you would implement a RainbowTorchPolicyGraph
that wraps the rainbow code, that way it can support running distributed
with Apex. This can be added to the agents/dqn directory similar to pytorch
support for a2c.
On Sun, Apr 14, 2019, 9:41 PM szymonWojdat notifications@github.com wrote:
Just to make sure I understand the whole process of porting Kaixhin's
Rainbow:
- Add an RLlib dependency to Rainbow so that it uses RLlib's
Evaluators and Trainers- Add a few "run" scripts to RLlib that would use the stuff that just
got integrated into Rainbow
Is this correct? I think I might be missing something in what do we
exactly add to RLlib (assuming more than just run examples).—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-483105249,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6Sroml9pTSHSH-e6Sy5CTlvO3zV48ks5vhAMLgaJpZM4b1YCC
.
https://ray.readthedocs.io/en/latest/rllib-concepts.html has an overview of the high level algorithm organization in rllib, of which you'd need to implement only the policy graph component since the rest is already there for DQN.
Thanks for the tips. I've been looking for a way to include Kaixhin's Rainbow as a dependency, any advice on that? I guess RLlib must already have some dependencies that aren't installable via pip, so I an example should be enough. I thought you'd normally put those in python/ray.egg-info/dependency_links.txt
For the purposes of rllib I think the best solution is to do a port, which would mean conforming to the expectations of the API and using my code (and results) as reference.
Thanks! Will do
Hi, I noticed that in Rainbow implementation, Agent.learn() (which I assume is the corresponding method to our RainbowTorchPolicyGraph.lean_on_batch()) uses ReplayMemory class. Should I be porting that as well or is there an equivalent class in rllib? I assume there must be
RLlib will take care of replay -- the input to learn_on_batch will be
already the batch sampled from the replay buffer. So no need to worry about
it when defining the policy graph.
On Mon, Apr 29, 2019, 9:35 PM szymonWojdat notifications@github.com wrote:
Hi, I noticed that in Rainbow implementation, Agent.learn() (which I
assume is the corresponding method to our
RainbowTorchPolicyGraph.lean_on_batch()) uses ReplayMemory class. Should I
be porting that as well or is there an equivalent class in rllib? I assume
there must be—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-487818841,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSSVQAW23VS4XTD66RDPS7EAVANCNFSM4G6VQCBA
.
I took @szymonWojdat's branch and tried porting this the other day, there's an initial implementation at https://github.com/ankeshanand/ray/blob/master/python/ray/rllib/agents/dqn/rainbow_torch_policy_graph.py.
Somehow, it was pretty slow (I was trying to run an ApeX agent). @ericl Any ApeX specific stuff I should be aware of?
The config I was using was:
config = merge_dicts(
apex.APEX_DEFAULT_CONFIG,
{
"num_workers": 8,
"use_pytorch": True,
"num_atoms": 51,
}
)
Hm, could you post an example of the training result? I wonder if the GPU isn't getting used for PyTorch.
Otherwise, I think that config you posted looks reasonable.
Here's what I see on a V100 and 4 workers:
rllib train --env=Pong-v0 --run=APEX --config='{"use_pytorch": true, "num_workers": 4, "num_atoms": 51, "optimizer": {"debug": true}, "min_iter_time_s": 5, "timesteps_per_iteration": 5000, "learning_starts": 0}'
timing_breakdown:
get_samples_time_ms: 1.076
learner_dequeue_time_ms: 0.015
learner_grad_time_ms: 222.353
put_weights_time_ms: 30.778
replay_processing_time_ms: 0.701
sample_processing_time_ms: 0.636
sample_time_ms: 1.374
train_time_ms: 1.374
update_priorities_time_ms: 0.008
train_throughput: 1862.537
The GPU utilization is about 10%.
For single-threaded DQN execution, the time to learn on batch is slightly faster (178ms vs 222ms) -- not sure why.
rllib train --env=Pong-v0 --run=DQN --config='{"use_pytorch": true, "num_workers": 0, "num_atoms": 51, "learning_starts": 0, "num_gpus": 1, "sample_batch_size": 32, "train_batch_size": 512}'
info:
grad_time_ms: 178.668
learner:
default_policy: {}
max_exploration: 0.598592
min_exploration: 0.598592
num_steps_sampled: 5120
num_steps_trained: 81920
num_target_updates: 10
opt_peak_throughput: 2865.658
opt_samples: 512.0
replay_time_ms: 94.411
sample_time_ms: 91.891
update_time_ms: 0.002
Strange, so one training iteration of 5000 steps takes about 200 seconds for me, which seems excessive. I am running on a P100 machine with 4 workers (same config as yours, have verified that the GPU is being used, and the machine has 12 physical cores)
replay_time_ms: 1933.884
update_priorities_time_ms: 456.061
sample_throughput: 859.499
timing_breakdown:
get_samples_time_ms: 0.564
learner_dequeue_time_ms: 0.007
learner_grad_time_ms: 1438.682
put_weights_time_ms: 51.351
replay_processing_time_ms: 5.326
sample_processing_time_ms: 0.477
sample_time_ms: 5.817
train_time_ms: 5.817
update_priorities_time_ms: 0.003
train_throughput: 0.0
iterations_since_restore: 3
num_healthy_workers: 4
num_metric_batches_dropped: 0
off_policy_estimator: {}
pid: 22809
policy_reward_mean: {}
sampler_perf:
mean_env_wait_ms: 19.079955082821105
mean_inference_ms: 101.28750366182197
mean_processing_ms: 41.70060372614461
time_since_restore: 1019.4289181232452
time_this_iter_s: 215.07506322860718
time_total_s: 1019.4289181232452
timestamp: 1556611893
timesteps_since_restore: 15000
timesteps_this_iter: 5000
timesteps_total: 15000
training_iteration: 3
Oh hm, your inference time is 10x what I see: mean_inference_ms: 101.2
Also, the grad time is 7x slower (1400ms).
How fast does Kai's rainbow run for you?
On Tue, Apr 30, 2019, 1:16 AM Ankesh Anand notifications@github.com wrote:
Strange, so one training iteration of 5000 steps takes about 200 seconds
for me, which seems excessive. I am running on a P100 machine with 4
workers (same config as yours, have verified that the GPU is being used,
and the machine has 12 physical cores)replay_time_ms: 1933.884 update_priorities_time_ms: 456.061 sample_throughput: 859.499 timing_breakdown: get_samples_time_ms: 0.564 learner_dequeue_time_ms: 0.007 learner_grad_time_ms: 1438.682 put_weights_time_ms: 51.351 replay_processing_time_ms: 5.326 sample_processing_time_ms: 0.477 sample_time_ms: 5.817 train_time_ms: 5.817 update_priorities_time_ms: 0.003 train_throughput: 0.0iterations_since_restore: 3
node_ip: 10.0.0.7
num_healthy_workers: 4
num_metric_batches_dropped: 0
off_policy_estimator: {}
pid: 22809
policy_reward_mean: {}
sampler_perf:
mean_env_wait_ms: 19.079955082821105
mean_inference_ms: 101.28750366182197
mean_processing_ms: 41.70060372614461
time_since_restore: 1019.4289181232452
time_this_iter_s: 215.07506322860718
time_total_s: 1019.4289181232452
timestamp: 1556611893
timesteps_since_restore: 15000
timesteps_this_iter: 5000
timesteps_total: 15000
training_iteration: 3—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-487857929,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSRUSKK3Y27MF2QBKLDPS7545ANCNFSM4G6VQCBA
.
I will have to check, but I don't think the machine is the issue: I got similar performance on a V100 machine. I am running in development right now (https://ray.readthedocs.io/en/latest/rllib-dev.html), could that be the issue?
@ericl Could there be other sources of performance bottleneck? I built ray from scratch, and looked through https://ray.readthedocs.io/en/latest/troubleshooting.html.
No that I know of. One question is whether inference and backprop work
faster outside of ray or not (i.e., is it a environment issue or Ray
related issue?)
On Wed, May 1, 2019, 11:13 AM Ankesh Anand notifications@github.com wrote:
@ericl https://github.com/ericl Could there be other sources of
performance bottleneck? I built ray from scratch, and looked through
https://ray.readthedocs.io/en/latest/troubleshooting.html.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4371#issuecomment-488364189,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSQCJ25G7KDREBHVSKTPTHMT7ANCNFSM4G6VQCBA
.
Any progress with that? Can someone share existing code? I'd gladly help here.
@ericl, I saw you ran:
rllib train --env=Pong-v0 --run=DQN --config='{"use_pytorch": true, ...
But I couldn't find any DQN implementation for PyTorch in 0.7.2. Is it a new thing in 0.8 ?
I'll take this one. We agreed on unifying things in the Policy realm a little such that Agents don't need to care anymore about which backend is used and thus avoid "almost-duplicate" code. Done some preliminary work on PG and it looks ok. Will do DQN next.
Working on this now. Preliminary tests look good. Expect this to be fully functional within a week to 10 days.
NOTE: This will include the parameter-noise exploration option, but may not include the distributional head and noisy layers (to be added after the beforementioned initial version).
Sorry, a little late, but here we go.
https://github.com/ray-project/ray/pull/7597
I'm closing this issue now.
Most helpful comment
Hi, I'll have a look as well if this helps