Keras: NaNs when training ReLU's on an input with all zeros

Created on 14 Oct 2015  ยท  14Comments  ยท  Source: keras-team/keras

Hello!

I'm running into an issue where training a simple model on examples that contain all zeroes returns NaN for the weights and the loss. Here is an example:

import numpy as np 

from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.convolutional import Convolution1D

np.random.seed(0)

X2 = np.zeros([1, 1, 1])
Y2 = np.ones([1, 1, 1])

model = Sequential()
model.add(            
    Convolution1D(                    
        1,
        1,
        input_dim=1,
        border_mode='valid'))   
model.add(Activation('relu'))

model.compile(optimizer='adagrad', loss='MSE')

hist = model.fit(
    X2,
    Y2,
    nb_epoch=2)

Running the above code gives:

Epoch 1/2
1/1 [==============================] - 0s - loss: 1.0000
Epoch 2/2
1/1 [==============================] - 0s - loss: nan

This happens even when the number of examples is larger (in the above code, it's 1) โ€“ just one example that's all zero is sufficient for the NaN to occur. Changing the zero to any other number, even 0.00001, removes the problem. The problem also goes away when you remove the ReLU layer.

I don't get this problem when running on the PyPI Theano build. The problem only occurs when I pull the latest Theano build from their Github repo. However, using the older Theano build isn't an option, because of the concat bug there.

Does anyone know what's going on? Thanks in advance!

stale

Most helpful comment

NaN can appear in multiple circumstances, not just division 0 by 0. Usually the cause is a mathematically unstable computation. I had NaN problems with PReLu, but did not pinpoint the exact problem.

From Wikipedia on NaN (https://en.wikipedia.org/wiki/NaN#Operations_generating_NaN):

  • Operations with a NaN as at least one operand.
  • Indeterminate forms:

    • The divisions 0/0 and ยฑโˆž/ยฑโˆž

    • The multiplications 0ร—ยฑโˆž and ยฑโˆžร—0

    • The additions โˆž + (โˆ’โˆž), (โˆ’โˆž) + โˆž and equivalent subtractions

    • The standard has alternative functions for powers:

    • The standard pow function and the integer exponent pown function define 00, 1โˆž, and โˆž0 as 1.

    • The powr function defines all three indeterminate forms as invalid operations and so returns NaN.

  • Real operations with complex results, for example:

    • The square root of a negative number.

    • The logarithm of a negative number

    • The inverse sine or cosine of a number that is less than โˆ’1 or greater than +1.

From Theano mailing list on NaN (https://groups.google.com/forum/#!topic/theano-users/UTn3hepy1sw):

  • if error starts increasing then NaN appears afterwards: diverging due to too high learning rate
  • if NaNs appear suddenly: saturating units yielding non-differentiable gradient
  • NaN computation due to log(0) (for example if cross-entropy is used)
  • NaN due to floating point issues (to high weights) or activations on the output (could happen also in MSE)
  • 0/0, inf/inf, inf*weight...
  • solutions: weight clipping, l2 norm, lower learning rate, small value add to log(x), different weight initialization (glorot->gaussian)

All 14 comments

Pangwei, I think I've traced it to weighted_objective in models.py (the weighting is, I believe, used to apply class weights); if you strip away the weighting, then the nan's go away when you compute the gradients:

import numpy as np 

from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.convolutional import Convolution1D
from keras import objectives
from keras.models import weighted_objective

np.random.seed(0)

X2 = np.zeros([1, 1, 1])
Y2 = np.ones([1, 1, 1])

model = Sequential()
model.add(            
    Convolution1D(                    
        nb_filter=1,
        filter_length=1,
        input_dim=1,
        border_mode='valid'))   
model.add(Activation('relu'))
model.compile(optimizer='sgd', loss='MSE')

train_loss_weighted = weighted_objective(objectives.get("MSE"))(model.y, model.y_train, model.weights, None)
train_loss_unweighted = objectives.get("MSE")(model.y,model.y_train).mean() #weighted_loss(model.y, model.y_train, model.weights, None)
thegrad_weighted = T.grad(train_loss_weighted, model.params)
thegrad_unweighted = T.grad(train_loss_unweighted, model.params)
train_ins = [model.X_train, model.y, model.weights]

f_grad_weighted = theano.function([model.X_train, model.y, model.weights], thegrad_weighted)
print("weighted",f_grad_weighted(X2,Y2,np.ones(Y2.shape[:-1] + (1,))))

f_grad_unweighted = theano.function([model.X_train, model.y], thegrad_unweighted)
print("unweighted",f_grad_unweighted(X2,Y2))

Running the above gives:

('weighted', [array([[[[ nan]]]]), array([ nan])])
('unweighted', [array([[[[ 0.]]]]), array([-1.])])

Here are the contents of weighted_objective for convenience...it looks promising as there is a division by filtered_weights.sum()

def weighted_objective(fn):
    def weighted(y_true, y_pred, weights, mask=None):
        # it's important that 0 * Inf == 0, not NaN, so we need to filter
        # those out first
        filtered_y_true = y_true[weights.nonzero()[:-1]]
        filtered_y_pred = y_pred[weights.nonzero()[:-1]]
        filtered_weights = weights[weights.nonzero()]
        obj_output = fn(filtered_y_true, filtered_y_pred)
        weighted = filtered_weights * obj_output
        if mask is None:
            # Instead of calling mean() here, we divide by the sum of filtered_weights.
            return weighted.sum() / filtered_weights.sum()
        else:
            filtered_mask = mask[weights.nonzero()[:-1]]
            return weighted.sum() / (filtered_mask * filtered_weights).sum()
    return weighted

I don't see a problem in Theano here, but it isn't clear. If there is one, tell me. As you know a division by 0 would generate a nan and that could be the case. I don't understand enough to tell why the nan exist only in the dev version and not in the last release.

Yeah, I don't understand enough either. Can you replicate the error above? Nothing in the code obviously suggests that a division by 0 should exist when you're training a simple ReLU model.

NaN can appear in multiple circumstances, not just division 0 by 0. Usually the cause is a mathematically unstable computation. I had NaN problems with PReLu, but did not pinpoint the exact problem.

From Wikipedia on NaN (https://en.wikipedia.org/wiki/NaN#Operations_generating_NaN):

  • Operations with a NaN as at least one operand.
  • Indeterminate forms:

    • The divisions 0/0 and ยฑโˆž/ยฑโˆž

    • The multiplications 0ร—ยฑโˆž and ยฑโˆžร—0

    • The additions โˆž + (โˆ’โˆž), (โˆ’โˆž) + โˆž and equivalent subtractions

    • The standard has alternative functions for powers:

    • The standard pow function and the integer exponent pown function define 00, 1โˆž, and โˆž0 as 1.

    • The powr function defines all three indeterminate forms as invalid operations and so returns NaN.

  • Real operations with complex results, for example:

    • The square root of a negative number.

    • The logarithm of a negative number

    • The inverse sine or cosine of a number that is less than โˆ’1 or greater than +1.

From Theano mailing list on NaN (https://groups.google.com/forum/#!topic/theano-users/UTn3hepy1sw):

  • if error starts increasing then NaN appears afterwards: diverging due to too high learning rate
  • if NaNs appear suddenly: saturating units yielding non-differentiable gradient
  • NaN computation due to log(0) (for example if cross-entropy is used)
  • NaN due to floating point issues (to high weights) or activations on the output (could happen also in MSE)
  • 0/0, inf/inf, inf*weight...
  • solutions: weight clipping, l2 norm, lower learning rate, small value add to log(x), different weight initialization (glorot->gaussian)

A remark to anyone looking at this. (I'm trying to understand a possibly similar problem, and I sidetracked onto this because it's simpler than mine .)
It appears that using DebugMode in theano causes ReLU to have a nan gradient at 0.

Specifically, using current (dev) theano on linux, the following prints nan in DebugMode and 0.5 otherwise:

a =theano.tensor.fscalar("a")
b=theano.tensor.nnet.relu(a)
c=theano.grad(b,a)
f=theano.function([a],[c])
print f(0.0)

I don't know how reliably relu has a proper (sub)gradient outside debug mode, so I don't know if this issue is related to that question.

I am able to reproduce @bottler's error. The problem occurs when using optimizer=fast_compile but not when using optimizer=fast_run.

I also encountered the same bug. Is it a kind of "feature" of Theano? Probably better to report in Theano forums/issues.

Yes, this appears to be a Theano issue.

looks like it was already reported

I think this case is different. The current case is due to fast_compile disable stability optimization. They are needed in this case to make it work. To have fast_compile + just the stability optimization, use optimizer=stabilize.

I made an issue about this, as it get reported frequently enough to rethink that: #4442

If you update Theano to the dev version, it should work when optimizer=fast_compile or mode=FAST_COMPILE:

http://www.deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions

@gw0 i meet the prelu Nan error.could u share your solution for this case?

@Lzc6996 Unfortunately, I did not solve it. Restarting the training a couple of times, changing the weight initialization, and lowering the learning rate helped.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LuCeHe picture LuCeHe  ยท  3Comments

anjishnu picture anjishnu  ยท  3Comments

amityaffliction picture amityaffliction  ยท  3Comments

yil8 picture yil8  ยท  3Comments

NancyZxll picture NancyZxll  ยท  3Comments