Keras: Mask, hidden states and cost functions for sequence to sequence learning

Created on 23 Oct 2015  路  13Comments  路  Source: keras-team/keras

The way mask affects the cost function right now is not the best way. Note that the mask comes from the input and multiplies the final cost. This is right when we are doing sequence prediction, in which case the input has always the same length as the output. But, for text translation, question answering and other applications, the desired sequence does not always have same length as the input.

I'm working to rewrite the cost function API in this PR https://github.com/fchollet/keras/pull/802. But I'd like to hear opinions on what should be the best thing here. Right now, I'm planning to take mask out of the cost function completely since sample_weight can do that for us. The thing is, right now, we do not provide an automatic way to generate sample_weight AFAIK.

Another problem mask, as implemented right now, is not stateful-friendly (not even adaptive first state friendly):

self.activation(x_t + mask_tm1 * T.dot(h_tm1, u))

This will reset the hidden state to zero, instead of using the previous value. I suggest to do:

hh_t = self.activation(x_t + T.dot(h_tm1, u))
h_t = mask * hh_t + (1-mask)*hh_t

Note that this solution defaults to the previous one when not using stateful or initializing first states to zero. In this case the mask has always to be binary, which I believe is done already.

In conclusion, we have two problems to solve:

  • [ ] Take off mask of the cost function and provide a solution based on sample_weight
  • [ ] Rewrite Recurrent layers to use previous hidden state instead of resetting.

Let me know what you think.

stale

All 13 comments

Right now, I'm planning to take mask out of the cost function completely since sample_weight can do that for us.

Good idea. I believe this is a good approach.

The thing is, right now, we do not provide an automatic way to generate sample_weight AFAIK.

What do you mean exactly?

Your proposed solution for mask looks good to me too. Until now it wasn't an issue, but I agree it needs to be fixed to allow for stateful RNNs.

What do you mean exactly?

I was thinking about a helper function to calculate the appropriate sample weight from a given desired. But I guess this is ultimately on the user.

Also, what do you think about the idea of an adaptive initial state? Here we initialize to zero, but I usually read that people train that vector.

Having the initial be a learnable parameter of the layer would make sense, it might improve performance. How would that interact with the statefulness feature?

Both adaptive state and statefulness require the initial state to be a shared variable.

The way I solved statefulness was using updates to carry the last state from one batch to the beginning of the next one.

self.updates = ((self.h, outputs[-1]), )  # initial state of next batch

To reset the states, I used callbacks:

class ResetRNNState(Callback):
    def __init__(self, h, func):
        self.h = h
        self.func = func

    def on_batch_end(self, batch, logs={}):
        if self.func(batch, logs):
            self.h.set_value(self.h.get_value()*0)

This resets the state to zero. If want both adaptive initial state (which will be adapted only every so often) and statefulnes we should reset the h to the appropriate value calculated by the gradient updates. The callback should have to have access to that value and save it.

It is not as complicated as I explained I swear. But we could also ask the user if she wants statefulness, adaptive initial state or none.

I am also trying to do seq to seq learning with variable length sequences (text prediction).

But, for text translation, question answering and other applications, the desired sequence does not always have same length as the input.

I am also struggling with the same concern. I figured if I was, there are probably a few other people as well, so I would just want to comment that this would even further improve Keras if it could be fixed!

I wish I could help in some way, but I simply need to understand more and do some reading before I can offer helpful suggestions. Just wanted to add a voice.

One thing that I find myself needing from time-to-time is a general way to tell the neural network to "ignore the following dimensions for this sample at the input" and "ignore the following dimensions for this sample at the output."

The obvious example is an autoencoder where some of the inputs have missing features, it would be good to not compute the costs for those particular features when evaluating the quality of the reconstruction.

So right now sample_weight is done per sample, but it would be good to have the option to pass in a sample_weight tensor that has the same shape as the input. Not sure if this is already easy to do and I just don't know how.

@sergeyf check the PR I mentioned, I think this is how we are doing that. sample_weight will have the same dimension as the desired.

@EderSantana Sounds great, thanks!

Hi @EderSantana

There are many posts here about masking cost function without use of embedding layers, referencing various PRs, but I wonder what is the correct way of doing so based on the latest code.

Similar to what Sergey mentioned here, I have an auto-encoder network,AE below, learning to reconstruct sparse input vectors. And I want the cost function, for each sample, to be a sum over non-zero features. So, if X is the training data (X.shape = (n_samples, n_features)), the weight matrix will be the following:

sample_weights = numpy.where(X > 0, 1., 0.)
then,
AE.fit(X, X, nb_epoch=1, batch_size=batch_size, sample_weight=sample_weights, shuffle=True)

and I have sample_weight_mode='temporal',in compile, but it complains about number of dimensions need to be 3 while 2 was given; X = X.reshape((n_samples, n_features, 1)) didn't also help.

_weighted_objective_ seems to be the construct that should be able to work around this, but I am not sure how to make use of it in .fit().

Any tips on this will be very appreciated.

@hadi-ds I don't know exactly how to do what you want while keeping everything under Keras API. Sample weights masks the final costs, not the outputs of a layer, I think.
Recently my models have been falling outside Keras scope, what I have been doing is to use Keras to define the feedforwad pass, the cost function and optimization I do in raw Tensorflow (or Theano) which is more flexible. I've been doing things much more in line with https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html

Thanks @EderSantana
In fact, what I am trying to do with that 2D sample weight W is to mask the cost function (not output of a layer) on a per-sample & per-feature basis. So, if C_ij is part of the cost function associated with j'th component of i'th sample (so the un-weighted loss becomes \sum_ij C_ij), the _weighted_ cost function that I want to optimize for would be \sum_ij W_ij * C_ij.
Is this currently doable in Keras?

@hadi-ds I don't think so. Keras doesn't support spatial cost functions by default, you have to reshape you output layer: https://github.com/fchollet/keras/blob/master/examples/variational_autoencoder_deconv.py#L63-L69

I figured how to write such a cost function (having per-sample & per-feature weights W_ij) in theano. The non-zero component of sparse vectors my network reconstructs are in (0, 1]. Hence, I use the _ceiling_ function to get the weight matrix I want, i.e.,
with
x = [0, 0, 0.6, 0, 0.97, 0.2, 0, ...] --> ceil(x) = [0, 0, 1., 0, 1., 1., 0, ...].

Based on this observation, cost function can be written as

def weighted_vector_mse(y_true, y_pred):
    weight = T.ceil(y_true)
    loss = T.square(weight * (y_true - y_pred)) 
    # use appropriate relations for other objectives. E.g, for binary_crossentropy: 
    #loss = weights * (y_true * T.log(y_pred) + (1.0 - y_true) * T.log(1.0 - y_pred))
    return T.mean(T.sum(loss, axis=-1))
Was this page helpful?
0 / 5 - 0 ratings