Ray: [rllib] Implement R2D2: Recurrent Experience Replay in Distributed Reinforcement Learning

Created on 28 Oct 2018  路  9Comments  路  Source: ray-project/ray

Describe the problem

The results for R2D2 are quite good: https://openreview.net/forum?id=r1lyTjAqYX

We should add this as a variant of Ape-X DQN that supports recurrent networks. The high-level changes would include:

  • [ ] Wire up rnn in/out to the models of the DQN policy graph
  • [ ] Support storing and retrieving sequences from the replay buffer
  • [ ] Implement state burn-in on retrieval from the buffer (probably fast enough to do this in Python-land)
  • [ ] Add the sqrt value rescaling (page 2)
P3 good first issue rllib

Most helpful comment

Yeah you need to be part of the org, but it's probably ok - if you open a PR people will know

All 9 comments

Currently working on this (for some reason I can't assign the issue to myself?)

Yeah you need to be part of the org, but it's probably ok - if you open a PR people will know

For todo (1), "Wire up rnn in/out to the models of the DQN policy graph", isn't this just specifying "use_lstm": True in the model config?

Currently, the "use_lstm": True option isn't supported by DQN, so a bit of work needed to be done to allow for using that.

any progress here?

Ah not yet; no one is currently working on this. Open to contributions!

Looking at the results in the paper, there is a surprising performance gap between full recurrent R2D2 and a feed-forward version, considering the fact that most Atari envs are essentially MDPs.
They claim in the paper that on these envs the recurrent agent "learns better representations".

Can it be partially because LSTM agent just has a lot more parameters? They never mention whether they keep the number of weights the same, or just remove the LSTM layer.
Besides, they say that the LSTM agent receives previous action and reward as input to the RNN layer, but they never mention if the feed-forward agent has the same treatment.

Finally, citing the paper:
[RNN] improves performance even on domains that are fully observable and do not obviously require memory (cf. BREAKOUT results in the feed-forward ablation).
But if you look at the full results on page 18, their recurrent agent is actually worse than previous SOTA feed-forward on Breakout!

I wonder if their claims that LSTM is the main performance factor are actually correct.

Is there a good reason why R2D2 should be a Q-learning algorithm? I am approaching this from the viewpoint of someone wanting the most sample-efficiency out of my algorithms. I understand that experience replay does not tend to go well with actor-critic algorithms, for the reason that experience gathered from policy-based algorithms tend to get stale quickly.

However, there exist fixes to this such as soft actor-critic or ACER, though I am not sure how competitive the latter at present. It was already barely competitive with prioritised DDQN when it first came out. With regards to the former, I still don't understand what the differences are between energy-based policies, PGQL, and, for instance, normalised actor-critic, which seem to repackage the same idea in various forms. I have yet to understand their relative advantages and shortcomings.

It is also quite odd that the average experience utilisation of R2D2 is 80%, and ape-X's ratio is something like 130%. I wish there was more information about the distribution of rollouts consumed, for instance, if by prioritisation certain rollouts are consumed tens or hundreds of times, which would mean that a majority of rollouts are basically tossed out and never seen by the optimisation loop, but also that prioritisation is (perhaps) doing its job.

Is this issue still open ? If open Can I work on this ?

Was this page helpful?
0 / 5 - 0 ratings