Ray: [rllib] PPO loss affected by LSTM zero padding

Created on 29 Sep 2018 · 14Comments · Source: ray-project/ray

When we customize an RNN-based model or just set use_lstm to True, we need to pad zeros to sampled data, as the function TFPolicyGraph._get_loss_inputs_dict does.

However, when calculating the final loss, we need to remove the padded zeros. The current PPO loss directly multiplies the padded advantages and logp_ratio. https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ppo/ppo_policy_graph.py#L69

The summation is no problem, but for reduce_mean, we will have a large denominator due to the padded zeros.

Is this an implementation error or an approximation? If max_seq_len is a relatively small value, I think the current implementation can be regarded as an approximation for the sake of efficiency.

bug

Source

llan-ml

All 14 comments

Thanks for pointing this out. I agree that when max_seq_len is small the approximation is pretty close, and even if not it just scales the loss by a small factor.

This pattern seems dangerous in general though, since there could be loss functions that are intolerant of injected zeros. Two options I can think of to fix this:

Mask out the zeros in the loss (e.g., using tf.sequence_mask/boolean_mask).
Add the padding dynamically using TF ops in the LSTM model (using tf.sparse_to_dense?)

Any thoughts on which is preferable?

ericl on 29 Sep 2018

This could also be bad for the value loss, since it will tend to force the value function to zero (though only for the zero input).

ericl on 29 Sep 2018

In my own code, I adopt the option 1, like this:

max_seq_len = tf.reduce_max(seq_lens)
mask = tf.sequence_mask(seq_lens, max_seq_len)
mask = tf.reshape(mask, [-1])

# before reduce_mean
loss = tf.boolean_mask(loss, mask)

As for the option 2, I think it is not convenient for customized models, and harder to implement.

llan-ml on 29 Sep 2018

@llan-ml , if (2) was implemented as follows, would that address your concern for customized models?

def build_layers(self, inputs, num_outputs, options):
    last_layer = add_time_dimension(inputs, self.seq_lens) # pad input

    ... lstm cell code ...

    last_layer = tf.reshape(
        remove_time_dimension(lstm_out, self.seq_lens),  # remove padding
        [-1, cell_size])  

    return ...

I'm thinking that the advantage of doing this is that we don't need to patch all the loss functions that support LSTMs.

ericl on 29 Sep 2018

I believe that the function remove_time_dimension is friendly to build new models.

At the same time, we need to consider which elements should be passed into chop_into_sequences (https://github.com/ray-project/ray/blob/master/python/ray/rllib/evaluation/tf_policy_graph.py#L147). Maybe we should only process initial states and ignore other data in a batch.

llan-ml on 29 Sep 2018

Given the input and seq_lens, the option 2 needs to dynamically split the input into several parts. The function tf.sparse_to_dense cannot handle this since the argument sparse_values should be a 0-D or 1-D Tensor.

llan-ml on 29 Sep 2018

I tried to pad zeros in TF using while_loop and TensorArray as follows:

import numpy as np
import tensorflow as tf


def tf_pad_zeros(inputs, seq_lens):
    max_seq_len = tf.reduce_max(seq_lens)
    num_seqs = tf.size(seq_lens)

    tmp_array = tf.TensorArray(dtype=inputs.dtype,
                               size=num_seqs,
                               infer_shape=False)

    idx = tf.constant(0)
    slice_begin = tf.cumsum(seq_lens, exclusive=True)
    slice_size = tf.cumsum(seq_lens) - tf.cumsum(seq_lens, exclusive=True)

    def _pad_zeros(i, t):
        sliced = tf.slice(inputs, [slice_begin[i], 0], [slice_size[i], -1])
        pad_zeros = tf.zeros(
            [max_seq_len - slice_size[i]] + inputs.get_shape().as_list()[1:],
            dtype=inputs.dtype)
        padded = tf.concat([sliced, pad_zeros], axis=0)
        t = t.write(i, padded)
        return tf.add(i, 1), t

    _, tmp_array = tf.while_loop(cond=lambda i, _: tf.less(i, num_seqs),
                                 body=_pad_zeros,
                                 loop_vars=[idx, tmp_array])

    padded_inputs = tmp_array.concat()
    padded_inputs.set_shape([None] + inputs.get_shape().as_list()[1:])
    return padded_inputs


def numpy_pad_zeros(inputs, seq_lens):
    max_seq_len = np.max(seq_lens)
    f = np.array(inputs)
    f_pad = np.zeros((len(seq_lens) * max_seq_len, ) + np.shape(f)[1:])
    seq_base = 0
    i = 0
    for l in seq_lens:
        for seq_offset in range(l):
            f_pad[seq_base + seq_offset] = f[i]
            i += 1
        seq_base += max_seq_len
    return f_pad


seq_lens = np.random.randint(10, 50, size=100, dtype=np.int32)
inputs = np.random.rand(seq_lens.sum(), 10).astype(dtype=np.float32)

sess = tf.InteractiveSession()
tf_inputs = tf.constant(inputs, dtype=tf.float32)
tf_seq_lens = tf.constant(seq_lens, dtype=tf.int32)
padded_inputs = tf_pad_zeros(tf_inputs, tf_seq_lens)


def run_tf():
    return sess.run(padded_inputs)


def run_np():
    return numpy_pad_zeros(inputs, seq_lens)


x = run_tf()
y = run_np()
assert np.array_equal(x, y)

In [1]: %timeit run_tf()
4.99 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [2]: %timeit run_np()
3.3 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

llan-ml on 29 Sep 2018

Ah, it looks like tf.scatter_nd() is the recommended way to do this.

def tf_pad_zeros_scatter(inputs, seq_lens):
    padded_size = tf.reduce_max(seq_lens) * tf.size(seq_lens)
    indices = tf.boolean_mask(
        tf.range(padded_size),
        tf.reshape(tf.sequence_mask(seq_lens), [padded_size]))
    return tf.scatter_nd(
        tf.expand_dims(indices, 1), inputs,
        [padded_size] + inputs.get_shape().as_list()[1:])

This seems to be the fastest

Numpy
Mean latency 0.003065221309661865
Tf while
Mean latency 0.0033680152893066405
Tf scatter_nd
Mean latency 0.00010882377624511719

ericl on 29 Sep 2018

👍1

Unfortunately it looks like (2) further complicates the PPO multi-gpu / minibatch code a lot (it was already complicated to handle sequence lengths correctly). This is because dividing the data into minibatches is difficult if they aren't padded to sequence length.

ericl on 30 Sep 2018

I am not familiar with the logic of the current multi-gpu implementation. In my understanding, we can just roughly evenly divide the batch into several parts based on the seq_lens. Then, for each part, optimize like simple_optimizer.

If I understand correctly, currently, different towers share the same set of input placeholders, and the function LocalSyncParallelOptimizer.load_data is used to split the data in a placeholder for different towers.

Another logic could be to split data outside TF and define different sets of placeholders for each replica. On the other hand, in this way, we can optimize each tower simultaneously. I am not sure if there are other problems in this way.

llan-ml on 30 Sep 2018

Yeah, I agree moving the splitting logic outside of TF would make the implementation a lot simpler. The existing implementation is brittle since it does so much work using tf.split() and slice indexing internally, when it could be computed ahead of time in python code.

Btw, I found a different bug in the multi-gpu optimizer for RNNs. The batch is shuffled prior to loading, which breaks up the sequences: https://github.com/ray-project/ray/pull/2996

On the partially observed cartpole example: https://github.com/ray-project/ray/blob/3a3782c39fc2e75f53515660b866e7cc501b7704/python/ray/rllib/examples/cartpole_lstm.py
Magenta: before fix
Blue: after fix (performance is much better -- though the value function seems worse?)
shuf

I also tried applying sequence masking before the reduce_mean ops, but it didn't really seem to affect performance (as expected). Since the padding issue doesn't seem to have much effect on actual performance, I'm inclined to leave it as is for now due to the multi-gpu optimizer complications.

ericl on 30 Sep 2018

👍1

Fixed in master

ericl on 21 Oct 2018

@ericl Can you refer the corresponding PR?

llan-ml on 21 Oct 2018

Here it is: https://github.com/ray-project/ray/commit/3c891c6ecec973a305e753fd8dcf8d374291e828

ericl on 21 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings