Keras: NaNs when training ReLU's on an input with all zeros

Created on 14 Oct 2015 · 14Comments · Source: keras-team/keras

Hello!

I'm running into an issue where training a simple model on examples that contain all zeroes returns NaN for the weights and the loss. Here is an example:

import numpy as np 

from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.convolutional import Convolution1D

np.random.seed(0)

X2 = np.zeros([1, 1, 1])
Y2 = np.ones([1, 1, 1])

model = Sequential()
model.add(            
    Convolution1D(                    
        1,
        1,
        input_dim=1,
        border_mode='valid'))   
model.add(Activation('relu'))

model.compile(optimizer='adagrad', loss='MSE')

hist = model.fit(
    X2,
    Y2,
    nb_epoch=2)

Running the above code gives:

Epoch 1/2
1/1 [==============================] - 0s - loss: 1.0000
Epoch 2/2
1/1 [==============================] - 0s - loss: nan

This happens even when the number of examples is larger (in the above code, it's 1) – just one example that's all zero is sufficient for the NaN to occur. Changing the zero to any other number, even 0.00001, removes the problem. The problem also goes away when you remove the ReLU layer.

I don't get this problem when running on the PyPI Theano build. The problem only occurs when I pull the latest Theano build from their Github repo. However, using the older Theano build isn't an option, because of the concat bug there.

Does anyone know what's going on? Thanks in advance!

stale

Source

kohpangwei

Most helpful comment

NaN can appear in multiple circumstances, not just division 0 by 0. Usually the cause is a mathematically unstable computation. I had NaN problems with PReLu, but did not pinpoint the exact problem.

From Wikipedia on NaN (https://en.wikipedia.org/wiki/NaN#Operations_generating_NaN):

Operations with a NaN as at least one operand.
Indeterminate forms:
- The divisions 0/0 and ±∞/±∞
- The multiplications 0×±∞ and ±∞×0
- The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
- The standard has alternative functions for powers:
- The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1.
- The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
Real operations with complex results, for example:
- The square root of a negative number.
- The logarithm of a negative number
- The inverse sine or cosine of a number that is less than −1 or greater than +1.

From Theano mailing list on NaN (https://groups.google.com/forum/#!topic/theano-users/UTn3hepy1sw):

if error starts increasing then NaN appears afterwards: diverging due to too high learning rate
if NaNs appear suddenly: saturating units yielding non-differentiable gradient
NaN computation due to log(0) (for example if cross-entropy is used)
NaN due to floating point issues (to high weights) or activations on the output (could happen also in MSE)
0/0, inf/inf, inf*weight...
solutions: weight clipping, l2 norm, lower learning rate, small value add to log(x), different weight initialization (glorot->gaussian)

gw0 on 13 Dec 2015

👍9

All 14 comments

Pangwei, I think I've traced it to weighted_objective in models.py (the weighting is, I believe, used to apply class weights); if you strip away the weighting, then the nan's go away when you compute the gradients:

import numpy as np 

from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.convolutional import Convolution1D
from keras import objectives
from keras.models import weighted_objective

np.random.seed(0)

X2 = np.zeros([1, 1, 1])
Y2 = np.ones([1, 1, 1])

model = Sequential()
model.add(            
    Convolution1D(                    
        nb_filter=1,
        filter_length=1,
        input_dim=1,
        border_mode='valid'))   
model.add(Activation('relu'))
model.compile(optimizer='sgd', loss='MSE')

train_loss_weighted = weighted_objective(objectives.get("MSE"))(model.y, model.y_train, model.weights, None)
train_loss_unweighted = objectives.get("MSE")(model.y,model.y_train).mean() #weighted_loss(model.y, model.y_train, model.weights, None)
thegrad_weighted = T.grad(train_loss_weighted, model.params)
thegrad_unweighted = T.grad(train_loss_unweighted, model.params)
train_ins = [model.X_train, model.y, model.weights]

f_grad_weighted = theano.function([model.X_train, model.y, model.weights], thegrad_weighted)
print("weighted",f_grad_weighted(X2,Y2,np.ones(Y2.shape[:-1] + (1,))))

f_grad_unweighted = theano.function([model.X_train, model.y], thegrad_unweighted)
print("unweighted",f_grad_unweighted(X2,Y2))

Running the above gives:

('weighted', [array([[[[ nan]]]]), array([ nan])])
('unweighted', [array([[[[ 0.]]]]), array([-1.])])

Here are the contents of weighted_objective for convenience...it looks promising as there is a division by filtered_weights.sum()

def weighted_objective(fn):
    def weighted(y_true, y_pred, weights, mask=None):
        # it's important that 0 * Inf == 0, not NaN, so we need to filter
        # those out first
        filtered_y_true = y_true[weights.nonzero()[:-1]]
        filtered_y_pred = y_pred[weights.nonzero()[:-1]]
        filtered_weights = weights[weights.nonzero()]
        obj_output = fn(filtered_y_true, filtered_y_pred)
        weighted = filtered_weights * obj_output
        if mask is None:
            # Instead of calling mean() here, we divide by the sum of filtered_weights.
            return weighted.sum() / filtered_weights.sum()
        else:
            filtered_mask = mask[weights.nonzero()[:-1]]
            return weighted.sum() / (filtered_mask * filtered_weights).sum()
    return weighted

AvantiShri on 14 Oct 2015

I don't see a problem in Theano here, but it isn't clear. If there is one, tell me. As you know a division by 0 would generate a nan and that could be the case. I don't understand enough to tell why the nan exist only in the dev version and not in the last release.

nouiz on 15 Oct 2015

Yeah, I don't understand enough either. Can you replicate the error above? Nothing in the code obviously suggests that a division by 0 should exist when you're training a simple ReLU model.

kohpangwei on 15 Oct 2015

NaN can appear in multiple circumstances, not just division 0 by 0. Usually the cause is a mathematically unstable computation. I had NaN problems with PReLu, but did not pinpoint the exact problem.

From Wikipedia on NaN (https://en.wikipedia.org/wiki/NaN#Operations_generating_NaN):

Operations with a NaN as at least one operand.
Indeterminate forms:
- The divisions 0/0 and ±∞/±∞
- The multiplications 0×±∞ and ±∞×0
- The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
- The standard has alternative functions for powers:
- The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1.
- The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
Real operations with complex results, for example:
- The square root of a negative number.
- The logarithm of a negative number
- The inverse sine or cosine of a number that is less than −1 or greater than +1.

From Theano mailing list on NaN (https://groups.google.com/forum/#!topic/theano-users/UTn3hepy1sw):

if error starts increasing then NaN appears afterwards: diverging due to too high learning rate
if NaNs appear suddenly: saturating units yielding non-differentiable gradient
NaN computation due to log(0) (for example if cross-entropy is used)
NaN due to floating point issues (to high weights) or activations on the output (could happen also in MSE)
0/0, inf/inf, inf*weight...
solutions: weight clipping, l2 norm, lower learning rate, small value add to log(x), different weight initialization (glorot->gaussian)

gw0 on 13 Dec 2015

👍9

A remark to anyone looking at this. (I'm trying to understand a possibly similar problem, and I sidetracked onto this because it's simpler than mine .)
It appears that using DebugMode in theano causes ReLU to have a nan gradient at 0.

Specifically, using current (dev) theano on linux, the following prints nan in DebugMode and 0.5 otherwise:

a =theano.tensor.fscalar("a")
b=theano.tensor.nnet.relu(a)
c=theano.grad(b,a)
f=theano.function([a],[c])
print f(0.0)

I don't know how reliably relu has a proper (sub)gradient outside debug mode, so I don't know if this issue is related to that question.

bottler on 3 Apr 2016

👍1

I am able to reproduce @bottler's error. The problem occurs when using optimizer=fast_compile but not when using optimizer=fast_run.

manuels on 22 Apr 2016

👍2

I also encountered the same bug. Is it a kind of "feature" of Theano? Probably better to report in Theano forums/issues.

chentingpc on 24 Apr 2016

Yes, this appears to be a Theano issue.

fchollet on 24 Apr 2016

looks like it was already reported

manuels on 24 Apr 2016

I think this case is different. The current case is due to fast_compile disable stability optimization. They are needed in this case to make it work. To have fast_compile + just the stability optimization, use optimizer=stabilize.

I made an issue about this, as it get reported frequently enough to rethink that: #4442

nouiz on 29 Apr 2016

If you update Theano to the dev version, it should work when optimizer=fast_compile or mode=FAST_COMPILE:

http://www.deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions

nouiz on 3 May 2016

@gw0 i meet the prelu Nan error.could u share your solution for this case?

Lzc6996 on 25 Jul 2016

@Lzc6996 Unfortunately, I did not solve it. Restarting the training a couple of times, changing the weight initialization, and lowering the learning rate helped.

gw0 on 25 Jul 2016

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.