Keras: Keras loss function explained

Created on 4 Feb 2018  路  8Comments  路  Source: keras-team/keras

Hey there,

I stumbled across the definition of mse in Keras and I can't seem to find an explanation.

def mean_squared_error(y_true, y_pred):
    return K.mean(K.square(y_pred - y_true), axis=-1)

I was expecting the mean to be taken across the batches, which is axis=0, but instead it
is axis=-1.

I also played around with it a little to see if K.mean actually behaves like the numpy equivalent.
I must have misunderstood something. Can somebody please clarify?

I can't actually take a look inside the cost function at run time right?
As far as I know the function is called at compile time, which prevents me from
evaluating concrete values.

I mean... imagine doing regression and having a single output neuron and training with a batch size of ten.

>>> import numpy as np
>>> a = np.ones((10, 1))
>>> a
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])
>>> np.mean(a, axis=-1)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

All it does is flatten the array instead of taking the mean of all the predictions.

Most helpful comment

@Nimi42, also frustrated by the same question. After a bit of searching around in the source code, I believe the answer hides in this file: keras/keras/engine/training_utils.py, where you can see this snippet:

def weighted_masked_objective(fn):
    """Adds support for masking and sample-weighting to an objective function.
    It transforms an objective function `fn(y_true, y_pred)`
    into a sample-weighted, cost-masked objective function
    `fn(y_true, y_pred, weights, mask)`.
    # Arguments
        fn: The objective function to wrap,
            with signature `fn(y_true, y_pred)`.
    # Returns
        A function with signature `fn(y_true, y_pred, weights, mask)`.
    """
    if fn is None:
        return None

    def weighted(y_true, y_pred, weights, mask=None):
        """Wrapper function.
        # Arguments
            y_true: `y_true` argument of `fn`.
            y_pred: `y_pred` argument of `fn`.
            weights: Weights tensor.
            mask: Mask tensor.
        # Returns
            Scalar tensor.
        """
        # score_array has ndim >= 2
        score_array = fn(y_true, y_pred)
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in Theano
            mask = K.cast(mask, K.floatx())
            # mask should have the same shape as score_array
            score_array *= mask
            #  the loss per batch should be proportional
            #  to the number of unmasked samples.
            score_array /= K.mean(mask)

        # apply sample weighting
        if weights is not None:
            # reduce score_array to same ndim as weight array
            ndim = K.ndim(score_array)
            weight_ndim = K.ndim(weights)
            score_array = K.mean(score_array,
                                 axis=list(range(weight_ndim, ndim)))
            score_array *= weights
            score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
        return K.mean(score_array)
    return weighted

where you can see the K.mean around the score_array. The score_array here should be the final array in your example after taking mean against axis=-1. Long story short, there is a final mean over all samples of the batch.

All 8 comments

This is not weird.
This is the definition of MSE.
We calculate the mean of the squared error of each dimension.
https://en.wikipedia.org/wiki/Mean_squared_error

Well ok. It's not wierd. But it's not clear to me either.

The definition says:
A vector of "n" predictions.

So the sum of the n predictions divided by n, which is the average.
So far so good.

If I have an array with say 10 predictions, which would be the batch size I guess,
my vector would have a shape of (10, 1) after subtraction with the observed values
and squaring them the shape would still be (10, 1) and now I would expect the mean
to be taken across axis=0 which is n.

>>> import numpy as np
>>> a = np.ones((10, 1))
>>> a
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])
>>> np.mean(a, axis=-1)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

I'm not saying that there is a bug in Keras, but I don't understand why it is done this way. I need to know though, because I want to build my own cost function.

Actually, I want to contribute my confusion to Keras' calculation.
The wiki formula is: the mean of the sum of squared errors, which is differently than the mean of squared error, as @Dref360 mentioned.
I'm actually expecting a single number out of mse for the error, but it seems that Keras' formula gives us a vector..
Does someone know where did the big sigma (Sum) disappear to?
Alternatively explain why was it dropped from Keras' formula?

The vector equals the number of batches. The mse is computed on the last axis. As designed.

I don't know if this helps, but I found this thread while searching for information on the loss function.

Am I right to believe that the loss function returns a representation of the calculation to be performed, and that that representation is compiled and executed? That is, the function itself is not called each time the loss is calculated?

For the problem I'm working on, I have noticed that (so far as I can tell) I can create an exact copy of the loss function, but replacing K. with numpy., in which case I can call the copy with concrete data?

If that is right, it should give Nimi42 a way to develop their loss function in a way that can be tested outside of keras itself.

@Nimi42
I check the numpy.mean,especilly set the axis=-1. The dimension will be change in 2D matrix.

Or just try tf.eager_mode, it will make Tensorflow define-by-run (like Chainer and Pytorch)

Will close since the main question has been answered.

@Nimi42, also frustrated by the same question. After a bit of searching around in the source code, I believe the answer hides in this file: keras/keras/engine/training_utils.py, where you can see this snippet:

def weighted_masked_objective(fn):
    """Adds support for masking and sample-weighting to an objective function.
    It transforms an objective function `fn(y_true, y_pred)`
    into a sample-weighted, cost-masked objective function
    `fn(y_true, y_pred, weights, mask)`.
    # Arguments
        fn: The objective function to wrap,
            with signature `fn(y_true, y_pred)`.
    # Returns
        A function with signature `fn(y_true, y_pred, weights, mask)`.
    """
    if fn is None:
        return None

    def weighted(y_true, y_pred, weights, mask=None):
        """Wrapper function.
        # Arguments
            y_true: `y_true` argument of `fn`.
            y_pred: `y_pred` argument of `fn`.
            weights: Weights tensor.
            mask: Mask tensor.
        # Returns
            Scalar tensor.
        """
        # score_array has ndim >= 2
        score_array = fn(y_true, y_pred)
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in Theano
            mask = K.cast(mask, K.floatx())
            # mask should have the same shape as score_array
            score_array *= mask
            #  the loss per batch should be proportional
            #  to the number of unmasked samples.
            score_array /= K.mean(mask)

        # apply sample weighting
        if weights is not None:
            # reduce score_array to same ndim as weight array
            ndim = K.ndim(score_array)
            weight_ndim = K.ndim(weights)
            score_array = K.mean(score_array,
                                 axis=list(range(weight_ndim, ndim)))
            score_array *= weights
            score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
        return K.mean(score_array)
    return weighted

where you can see the K.mean around the score_array. The score_array here should be the final array in your example after taking mean against axis=-1. Long story short, there is a final mean over all samples of the batch.

Was this page helpful?
0 / 5 - 0 ratings