Stable-baselines: PPO2 MlpLnLstm taking exponentially longer between updates

Created on 14 Nov 2019 · 20Comments · Source: hill-a/stable-baselines

I am training a PPO2 agent with MlpLnLstm policy on $10^6$ samples with 7 columns (features) making a total of 7 million 32 bit floats; a relatively small dataset.

My hyperparams are

n_steps: 1024,
gamma: 0.999,
learning_rate: 0.0005,
ent_coef: 0.04,
vf_coef: 0.6,
cliprange: 0.25,
noptepochs: 4,
lam: 0.85,
nminibatches: 1

(everything else (including network architecture) is left as default)

and the hardware is

CPU: AMD Ryzen threadripper 12 core (24 CPU)
GPU: EVGA (NVIDIA) 2070 RTX 2070 Super
RAM: Corsair Vengeance 32 Gb (2 x 16Gb)

I am using 24 actors in parallel to utilise all 24 CPU's when training.

I find that when training the model, there is massive overhead in between batch updates and this is increasing exponentially with every update; when n_updates was between 1-10 it was taking about 10 seconds between in each update, when n_updates was around 180 it was taking 21 minutes between updates, when n_updates is 205 it's taking 88 minutes between updates (with the hyperparams and actors set above, we get a total of around 1050 updates). When the update is taking place, I see the GPU cranks up and makes the update very quickly (like 5 seconds). But in between updates I see that GPU utilization is at 0% while all the CPU usage oscillates like a sine wave between 20% and 80%.

I would like to better understand how the CPU and GPU are being utilized by stable-baselines.

Why is there so much CPU overhead between updates? My (custom) gym environment is very simple; the observations are directly taken from the raw data (a csv file stored in a pandas dataframe), and there is no transformation / calculation made on observations at each step. Also, the reward is very quick to calculate (only $10^{-4}$ s on an array of $10^5$ elements, as I am using numpy (this array increases by 1 with every serial timestep). At first I thought that the untrained agent would "die" (done=True) a lot in the first few iterations, which would cause time between updates to be very quick. But even when the agent stays "alive" for more steps in later iterations, 88 minutes between updates seems far too long. How can each actor (each CPU) stepping through the environment for 1024 steps take so long?

Is there a some hidden "minimum loss decrease" parameter?" If the algorithm only updates when some minimum loss change between updates is achieved, then maybe that could be why it takes so long to update. Something analogous would be the Keras EarlyStopping min_delta method. If this is the case, can we change this by changing some PPO2 parameter or kwarg?

Increase the architecture size. Issue #308 seems to suggest that increasing the network architecture would at least increase the GPU utilization at update time (which may also help convergence). I am open to trying this, but I don't think it really answers my first question above.

My ultimate goal is just to have a fully trained agent that traversed the entire dataset of 1 million samples in a reasonable amount of time (even 2-3 days). Any changes I can make to my gym environment / PPO2 params or any explanation on how stable baselines utilises the hardware would be much appreciated. Thanks

RTFM custom gym env question

Source

ktattan

All 20 comments

It sounds a bit like trying to backpropagate too much back in time with LSTMs, but that should not happen with stable-baselines. It is hard to say only based on this what could be wrong.

Note that this is not a place for tech support, but feel free to share more information so we can dig deeper if there is a bug somewhere :)

Miffyli on 14 Nov 2019

👍1

Sure, I appreciate it's not tech support here. I am really just trying to understand the workings of stable-baselines so that I can change my code and speed things up.

I don't think there's a bug anywhere (although its a possibility), but a better understanding of the package would help me tweak my model training setup.

What more information can I give you to help you out?

ktattan on 14 Nov 2019

👍1

Ideally this would be minimal code to show how this happens, and considering it is a custom environment it also needs so show observation/action spaces and how the observations are created. I can not run code comfortably right now so comments are rather superficial, mind you.

Also regarding your second question (on minimum loss decrease): PPO2 only does fixed amount of updates on the gathered dataset (noptepochs times over the gathered set of n_steps samples).

Miffyli on 14 Nov 2019

👍1

I'll try and give you a minimal code example of my custom env below. I've left a lot out, but this includes the most import parts.

def __init__(self, ...):
    self.action_space = gym.spaces.Discrete(21)
    self.data = data  # dataframe with 1 million rows, 7 columns (read in via pandas.read_csv())
    self.observation_space = gym.spaces.Box(low=0., high=1., shape=self.data.shape[1], dtype=np.float32)
    self.initial_balance = 10000

....

def _next_observation(self):
    obs = self.data[self.current_step]
    return obs

....

def _reward(self):
    arr = np.diff(self.net_worth) / self.net_worth[:-1]  # returns decimal percentage
    reward = arr[-1]   # take last value
    return reward

....

def _done(self):
    if self.net_worth[-1] <= self.initial_balance * 0.2:
        should_stop = True
    else:
        should_stop = False
    return should_stop

....

def step(self, action):
    # _take_action() is a simple function that converts integer action to BUY amount or SELL amount in percentage
    # and also appends to the self.net_worth array
    self._take_action(action)  
    self.current_step += 1
    obs = self._next_observation()
    reward = self._reward()
    done = self._done()
    return obs, reward, done

I also have a reset function that simply resets everything to 0.

ktattan on 14 Nov 2019

Hmm on a glimpse this environment seems alright and should work fine. Have you tried using "MlpLstm" policy rather than the "MlpLnLstm" policy? This one has worked as expect for me in my experiments. Other than that I do not have other suggestions to give other than starting to debug timings and try to pin down what takes long :/

Miffyli on 14 Nov 2019

👍1

Thanks for the tip. I'll give MlpLstm a try and also insert some timings and print statements in different places to see where it's taking so long. Stupid question, but is there somewhere I can put such timings and print statements in the environment so that they just print out in between batch updates rather than having thousands of print statements for every serial timestep? Perhaps there's a simple PPO2 callback API I could use?

ktattan on 14 Nov 2019

Yes, you can this example for better monitoring. However I would go inside PPO2 code and start debugging around there what is slowing things down. You could gather timings inside the environment step to see, if for some reason, agent is gathering samples slower.

If MlpLstm also does not work, try vanilla "Mlp" without LSTM. This won't solve the problem but at least we can find out if the issue is coming from code related to recurrence.

Miffyli on 14 Nov 2019

Yes I remember trying to use the callbacks before, but for some reason it error'd out when I trained the model in parallel using multiple CPUs. From the Callback documentation, I don't see any extra params I need to pass / precautions I need to take when training in parallel, unless you can advise otherwise.

ktattan on 14 Nov 2019

Hmm those callbacks should work fine even with multiple workers, but I have not used MPI-based algorithms in a while. Callbacks with PPO2 and multiple workers should work as expected.

But regardless, I'd start by trying the different policies and then seeing if these slow down gathering of samples from environment. The fact that updates take few seconds sounds about right, but the rest does not.

Miffyli on 15 Nov 2019

Sure, I'll try and use the callbacks again anyway.

After trying the "MlpLstm" policy, I found that it is still taking longer and longer between updates (>20min). I guess I need to dig into my environment code and PPO2 code some more to understand how it could possibly be taking so long to gather samples.

ktattan on 15 Nov 2019

@Miffyli by any chance, does having a small reward ( $\sim 10^{-4}$ ) at each step slow down convergence? Should rewards be in some certain range for faster iteration / convergence, say on the order of $10^0$ to $10^3$ ? Given my environment above, how long would you expect between updates? I timed stepping through my environment for n_steps=1024 on one CPU and appending the reward to a list at each stage and this only took about 10ms, so this may suggest that something is happening with PPO2 code that is slowing things down?

ktattan on 17 Nov 2019

Because of how networks work you should keep your advantages/returns in a "comfortable" range (e.g. [-1, 1]). For PPO2 the policy is updated with normalized advantages by default, but value network still aims to predict these small-value returns, which _could_ be unstable. This all is just a hunch, though, and the slower processing should not stem from this.

For your environment things should be pretty quick (you should not have to wait for hours to get results). Have you tried MlpPolicy yet? I do not expect it to function well for your task, but if that works fine then something is wonky with recurrent policies. You could also monitor CPU and memory, specifically if the memory usage is growing as the code gets slower and slower.

Miffyli on 17 Nov 2019

So I tried the MlpLstm and that certainly sped up the initial setup time; originally for MlpLnLstm I had to wait 40 minutes before even the first update box (when verbose=1) was printed and in addition memory usage was up to 80% (and stayed there for the duration of training), now with MlpLstm it sets up in about 60 seconds and memory usage is only at 40% (and again, stays there for the duration of training). So memory usage doesn't grow as the code gets slower, it just reaches the maximum of 80% or 40% and stays there.

I haven't tried MlpPolicy yet, but I'll try now and let you know.

ktattan on 17 Nov 2019

MlpPolicy had the same slowing performance unfortunately...

One thought I had is that this could be an issue with how SubprocVecEnv in implemented? Even if I reduce the number of CPU's to 2 (instead of the available 24), after some 50 updates, CPU usage suddenly jumps from 10% utilisation to 100% (for all CPU's!). Is this expected behaviour? I would have thought that just 2 CPU's would be at 100% while the others would stay down close to 1-10%. Using less CPU's certainly does make time between updates faster (40-50 seconds between updates), but I can see it is still getting generally slower for every new update (about 1-2 seconds slower every 1-2 updates). By update 450, its taking 150 seconds between updates.

For reference, I used this example to set up my parallel environment

import numpy as np
import pandas as pd
import os

from stable_baselines.common.policies import MlpLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2
from stable_baselines.common import set_global_seeds

from my_env import MyEnv

data = pd.read_csv("data.csv")
N_ACTORS = 2  # number of CPU's to use - originally set to os.cpu_count()
N_TIMESTEPS = N_ACTORS * data.shape[0]


def make_env(data, rank=0, seed=0):
    def _init():
        env = MyEnv(data=data)
        env.seed(seed + rank)
        return env
    set_global_seeds(seed)
    return _init

def train():
    train_env = SubprocVecEnv([make_env(data=data, rank=i) for i in range(N_ACTORS)])
    model = PPO2(MlpLstmPolicy, train_env, verbose=1)
    model.learn(total_timesteps=N_TIMESTEPS)
    model_path = "agents/agent_0.pkl"
    model.save(model_path)
    return


if __name__ == "__main__":
    train()

ktattan on 17 Nov 2019

That behaviour does not sound normal at all, especially if you are using GPU with stable-baselines (it should utilize GPU in spikes for training and CPU only for environments).

Two things pop to my mind:
1) You seem to share the same data object with all workers. I am not too familiar with Pandas to know what this object is exactly, but it could end up being shared in a wonky way. I would move reading dataset just before env = MyEnv(data=data), just to make sure this is not breaking things.
2) Try DummyVecEnv instead of SubprocVecEnv. Since your environment is computationally very simple, using different Python processes (SubprocVecEnv) adds considerable overhead. See note here for more info.

Miffyli on 18 Nov 2019

👍1

Thanks again for the help. I'll implement your suggestions and test again.

Can I just clarify a few things.

By your first point, do you mean

def make_env(rank=0, seed=0):
    def _init():
        data = pd.read_csv("data.csv")
        env = MyEnv(data=data)
        env.seed(seed + rank)
        return env
    set_global_seeds(seed)
    return _init

won't that create a duplicate dataframe for each environment, thereby using far more memory? (p.s. in reality I am passing a numpy array into the environment, but I first read in the data using pandas, then convert to numpy).

Actually all I want to do is train multiple agents in parallel using all my CPU's. I see that DummyVecEnv "Creates a simple vectorized wrapper for multiple environments" - is it possible to set up an environment using DummyVecEnv but to be run in parallel to collect multiple samples simultaneously like SubprocVecEnv?

ktattan on 18 Nov 2019

1) Yes, and yes it will create multiple instances of the dataset in memory, but with Subprocesses (SubProcVecEnv) this was already probably happening behind the scenes, depending on how Pandas dataframes etc work.

2) If you want to train multiple agents in parallel, then you need to create different instances of PPO2/A2C/etc algorithms and call train on them. These vectorized environments just speed up / stabilize learning by gathering samples faster. Even if DummyVecEnv does not use actual multi-processing to run truly parallel environments, it can be faster because of the overhead from communicating between processes when using SubProcVecEnv.

Miffyli on 19 Nov 2019

Really appreciate all the help @Miffyli

Having timed several parts of the ppo2 code, I isolated that 99% of the time comes from stepping through the environment here - all other lines in the Runner class take less than 1% of the time.

With this in mind, I started to time each of the individual functions within my custom environment and found that most functions where increasing in time with each new batch update. However I don't fully understand why this would be so. The minimum code example I posted above includes most of the code without getting too complicated; I have excluded some list appending and I also use scikit minmax scaling at every step, but all these functions are very fast if timed in isolation.

The only thing I could think of that if there is some list or object that is not reset with every batch update and keeps increasing in size - but even this would show up in monitoring the memory usage, but my memory usage stays constant over time.

I couldn't find where the environment is being reset in the PPO2 code - does it get reset with every new batch update? Could you maybe explain / point out how the environment gets reset between updates and perhaps if there is any object / data that keeps increasing in size in the updates for loop?

ktattan on 21 Nov 2019

I couldn't find where the environment is being reset in the PPO2 code

This is done automatically when using VecEnv (cf doc).

Sounds like a problem from your environment, you should debug it using a random agent env.action_space.sample() (cf #536 ). Note that as @Miffyli mentioned before, we do not do tech support, so please ping us again only if you think this comes from Stable-Baselines.