I have multiple game-playing agents hooked up to a model that spits out both their next move as well as a vector of symbols to 'communicate' with their fellow agents. I plan to build out a custom policy that calculates an intrinsic reward based on the interplay between actions taken this timestep and symbols received last timestep.
What I'm struggling with is the right way to persist this bag of communication vectors; while calculating the reward for a given agent I'd need the communication vectors passed around from the last timestep.
I've been considering adding the symbol emission to my action space so my environment's step function can hold all the vectors (possibly in prev_actions), alternatively it seems like one could use callbacks such as on_episode_start to hold the required data. I'm not sure what the best practice for this kind of data-passing would be.
Can it be emitted as an action and included as part of the observation of agents in the next timestep? The env would have to do this internally.
For calculating the rewards, it sounds like you can do it in the env as usual, if you save the last action/symbols, or it could also be done in a on_postprocess_traj callback where you have the opportunity to rewrite the entire rollout sequence if needed.
Including the symbols as an observation did not not occur to me, I'll play around with encoding the communication as an action.
Regarding intrinsic rewards, I was planning to hard-code them into the policy to differentiate from environmental reward, as the collective environmental reward is the final 'score.' Is there a better way to differentiate between the two types of reward for reporting purposes?
Currently we do not really have a "shared" reward for the environment --- that would have to be added individually to each agent's rewards by the env. Reporting wise episode_reward corresponds to the sum of all agent rewards in the episode, and the policy_rewards rewards seen by each type of policy controlling the agent.
Assigning intrinsic rewards in a callback sounds like a good option, assuming they cannot be calculated purely in the env and requires access to the policy's variables.
Intrinsic reward can be calculated purely in the env here if the symbols from the last timestep are available as observations.
So the env computes total_reward = environmental_reward + intrinsic_reward, but for reporting purposes I would want access to a breakdown such as episode_total_reward, episode_environmental_reward, episode_intrinsic_reward. I'm a bit fuzzy on the best way to achieve that kind of breakdown.
I see, I think that would probably be best recorded as a custom metric. There's a few ways to do it but env could return these reward breakdowns in the info return from the env, and the callback can retrieve it from the rollout batch in on_postprocess_traj: https://ray.readthedocs.io/en/latest/rllib-training.html#callbacks-and-custom-metrics