Keras: Loss turns into 'nan' when running on GPU

Created on 11 Dec 2015  Β·  89Comments  Β·  Source: keras-team/keras

Like previously stated in issue #511 Keras runs into not a number losses while training on GPU.
Tested this with the mnist_cnn example code aswell as with self designed conv networks. I also tried to disable cuDnn, aswell as increasing the epsilon and setting a clinorm. Nothing solved the poblem.

I'm using the latest version of Theano and Keras. And SGD optimisation with categorical crossentropy.

Graphics: GTX 980 Ti

Most helpful comment

As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

All 89 comments

I'd like to identify what op is causing this issue.

  • post your code here.
  • try to use a different loss than categorical crossentropy, e.g. MSE

Here is the net-part of my code. I'll try other loss functions, but they take some time to provide useful evidence, since you can't determine when the loss turns into 'nan'.

img_rows = img_cols = 128
img_channels = 3
l1 = l2 = 0

# convert data for GPU use
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices
y_train = np_utils.to_categorical(y_train, nb_classes)
y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()

model.add(Convolution2D(16, 5, 5, border_mode='same',
                        input_shape=(img_channels, img_rows, img_cols), W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(16, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(32, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(64, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Flatten())
model.add(Dense(1024, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Dropout(0.6))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,shuffle=True, show_accuracy=True, callbacks=[history])

As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

I first tested the 'tanh' activation wich didn't helped. It was no suprise though, since this problem is specific to calculations on the GPU and not a general one with numerical stability.

I also tried mse as loss function, which ran into 'nan' aswell.

I also tried mse as loss function, which ran into 'nan' aswell.

In that case the overflow is happening earlier in the graph.

Next, you could try removing the regularizers.

BTW the history callback is included by default, no need to specify it manually.

Correct me if I'm wrong, but with l1 = l2 = 0 it should not matter that l1l2-regulizer is defined?

But I'll try to remove them.

Of course, it _should_ not matter. Also, there should not be a float32 overflow.

Okay, I removed all W-regulizers and the 'nan' loss still occurs.

I noticed that the netloss is more likely to output 'nan' when used deeper (e.g. 8 conv layers instead of 6) and wider (e.g. 512 feature maps) networks.

I am also having problem with nans on loss.

I realized that the weights became nan, but I dont know if this changed before or after the loss calculation. Which is very strange, since the values to calculate crossentropy is clipped before apply the objective function...

The same thing happened to me instead I was using keras to build a regression model. I have tried different loss(rmse or mae) and also sigmoid tanh apart from relu. Nothing helps to imporve this case.

I think I've fixed it on the PR #1368.
@fchollet what do you think?

I get your point on preventing division by zero, but this doesn't explain why this problem is specific to certain GPU especially the GTX 9XX series. (Never had a problem on my GTX 670).

I'm getting the same problem. I think this is a problem with the system configuration more so than with the code. My code used to work, but then I had to reformat my computer and reinstall everything, and now I'm getting "nan" loss. So I think it's something with the configuration of Theano, CUDA, Visual Studio, or CuDNN, at least in my case. Still trying to figure it out.

I'm also getting this problem (Ubuntu 14.04, GTX 980Ti/970, Theano as backend, CNN with residual units, ReLU, BN, mse/mae loss).

In my case problem occurred randomly, the probability of getting nan is increasing with model's complexity (and memory usage). When loss become nan loading of saved weights doesn't help to continue training (weights become corrupted on first training iteration). Only recompilation or creating new model allow to continue training.

It works for me now. I had installed cuDNN incorrectly - previously I had just dragged the cuDNN files and dropped them in the CUDA folder, replacing anything with the same name. So I re-installed Visual Studio (2013), Anaconda, Theano, and Keras. It still gave me "nan". So then, I installed cuDNN, but this time, I did this by extracting the cuDNN files to their own directory, and then just added that directory to my path. I think that was the key factor for me: installing cuDNN (properly). I was using relu and adam the whole time.

The same problem on Tesla M2090. I tried consume_less gpu and cpu. GRU is working ok.

Anybody any progress on this issue?

I had this problem - nets that worked perfectly fine on various CPU hardware failed to train on AWS GPU-enabled remote machine.

I removed Theano 0.8.0, and upgraded to the bleeding-edge version from GitHub (which is 0.9.0-dev2). Now training works perfectly.

Can't blame this one on Keras, folks!

On CPU I am getting nans too, but after more epochs than on GPU.

I have the same problem. I train an LSTM network with my own data. The train_loss becomes NaN suddenly. I checked my code with imdb dataset. It is working OK. But when I switch to my dataset nan problem occurs. I preprocessed my data in the same way that imdb dataset preprocessed in imdb_lstm example of keras. I do not understand what the problem is. It seems that network configuration is OK since it run with another dataset. However, my dataset and imdb dataset are both text. How come does another text dataset cause this issue? I tried gradient clipping also weight norm limitations. I think sudden change happens when inf value is calculated with categorical_cross entropy function such as log(0). But how can I determine and avoid this problem?

I also had this problem. I fixed it when I changed the Y values to float numbers. For example 0.0 1.0 instead of 0 and 1.

like @9thDimension said, upgrading Theano to the bleeding-edge version (0.9.0-dev2) seems to have fixed the nan issues for me so far on debian wheezy. i'm using a python 3.5.2 env in anaconda 4.1.1.

i just followed the instructions from the theano website here:
pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

I'm training on the CPU and using tensorflow as backend, also getting the nan issue.

Was having the same issue with a regression task. Upgrading Theano didn't work but changing the optimizer from 'sgd' to 'rmsprop' seemed to help.

@skerit did you figure it out?

I had the nan-problem as wel and I solved it by changing the floatx value in ~/.keras/keras.json from float32 to float64. (tested on GPU)

This is the description of my setup:
Backend: tensorflow and theano
Optimizer: Adam
GPU: Titan X and GTX 970
Activations: RELU
Last layer activation: sigmoid
Objective: binary cross entropy

If more details are needed, let me know.

EDIT: the problem was not solved by this but the training lasted longer so an acceptable loss variable was reached
EDIT2: after re-reading the images and saving them, the training lasted even longer

Did you time it? It if you use the current GPU backend, this cause all
competition to be on the CPU.

Le 23 nov. 2016 06:56, "Patricio Astudillo" [email protected] a
Γ©crit :

I had the nan-problem as wel and I solved it by changing the floatx value
in ~/.keras/keras.json from float32 to float64.
This is the description of my setup:
Backend: tensorflow and theano
Optimizer: Adam
GPU: Titan X and GTX 970
Activation: RELU
Last layer activation: sigmoid
Objective: binary cross entropy
If details are needed, let me know.

β€”
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/1244#issuecomment-262495086,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALC--4Nl68nV3NmAV1WLgiZKfSjHYilks5rBCoHgaJpZM4GzdUg
.

I was having the same issue. Disabled, CudNN (optimizer_exclude=cudnn); everything works fine. And slow.

I too ran into a similar issue where the loss and layer weights would suddenly be set to nan during training with floatx as float32 (it worked fine with float64 but was much slower).

I was able to fix this by applying either the clipnorm or clipvalue optimizer attributes (https://keras.io/optimizers/#parameters-common-to-all-keras-optimizers). It seems that for me this was a case of exploding gradients, which may not be true for all cases reported here. I just thought I'd mention what worked for me in case that's helpful to others.

I used clipnorm, too, and it allowed me to use the adam optimizer. I wonder if using clipnorm might have a negative impact on the accuracy.

i know that clipnorm fix this issue and i know that clipnorm clip the big number of the gradients but i want to know why the nan is produced?
why do i see loss=nan when i don't use clipping of the gradient?

I have the same issue during training on GPU 3D Convolutional network. I use float32 and theano backend.

I have the following in ~/.keras/keras.json:

  1 {
  2     "floatx": "float32",
  3     "epsilon": 1e-07,
  4     "image_dim_ordering": "tf",
  5     "backend": "tensorflow"
  6 }

and I got nan:

mona@pascal:~/computer_vision/VPilot$ python train.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:1938: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))
Epoch 1/1000
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x4750d80
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:83:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40c, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 4777 get requests, put_count=3270 evicted_count=1000 eviction_rate=0.30581 and unsatisfied allocation rate=0.54574
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110
    4/70629 [..............................] - ETA: 364851s - loss: 0.5890I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 755 get requests, put_count=1771 evicted_count=1000 eviction_rate=0.564653 and unsatisfied allocation rate=0
    8/70629 [..............................] - ETA: 194931s - loss: 0.5553I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 247 get requests, put_count=1270 evicted_count=1000 eviction_rate=0.787402 and unsatisfied allocation rate=0
   13/70629 [..............................] - ETA: 129454s - loss: 0.5582I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5071 get requests, put_count=4961 evicted_count=2000 eviction_rate=0.403145 and unsatisfied allocation rate=0.423979
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 449 to 493
   18/70629 [..............................] - ETA: 100341s - loss: 0.5194I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5145 get requests, put_count=5327 evicted_count=2000 eviction_rate=0.375446 and unsatisfied allocation rate=0.365986
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 720 to 792
   25/70629 [..............................] - ETA: 79355s - loss: 0.5875I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5137 get requests, put_count=5388 evicted_count=1000 eviction_rate=0.185598 and unsatisfied allocation rate=0.175784
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 1694 to 1863
70629/70629 [==============================] - 25358s - loss: nan - val_loss: nan
Epoch 2/1000
70629/70629 [==============================] - 24899s - loss: nan - val_loss: nan
Epoch 3/1000
70629/70629 [==============================] - 24967s - loss: nan - val_loss: nan
Epoch 4/1000
70629/70629 [==============================] - 24987s - loss: nan - val_loss: nan
Epoch 5/1000
70629/70629 [==============================] - 24855s - loss: nan - val_loss: nan
Epoch 6/1000
70629/70629 [==============================] - 24977s - loss: nan - val_loss: nan


I have this problem on keras 1.2.1 and theano 0.9.0b1. My epochs are already starting with nan. Adding a clipvalue=1, changing the learning rate and trying different optimizers did not help.

I also have the same issue training LSTM network by multi_gpu.py, using mse as loss function.

I get NaN for a linear regressor at the time of model.evaluate for Adam or a tf backend FTRL optimizer. Have tried changing parameter size of the NN arch, played around with learning_rates, regularizers, clipping etc.. No luck. I am running on 3 Tesla-X GPU.

(BTW, happens only when I allocate more than 1 GPU)

I will post my experience and my solution. One of the key things of saturation has really nothing to do with the cost or the parameters, but the updates. The softmax function has some tricks for preventing overflow when the previous layer is a ReLu that can output high values. Im really sure theano implements this tricks but I have not checked it.

So normally desactivating cudNN can solve the problem. I experience this problem on a classification convolutional neural network and on a MSE fully connected neural network. With good parameter initialization and data normalization I have saturation (did not initialize the weights with values in a 10Β³ order for example). First deactivate cudNN as this makes approximations.

Then, with the same code I experienced saturation by changing the theano version. In one theano version it does not saturates and in the other it does. Moreover depending on the GPU I also see saturation. On a GTX 1070 I have more saturation than on a GTX 1080. Hopefully with the new theano back-end we will have float64 adaptation but for the moment it does not seem to happen.

So finally the way I solved this is by scaling the cost function. Saturation sometimes happens because on an early layer the derivative respect to a weight is a combination of a sumation of mini-batch size (and as soon as you go early in your network more sumations influence). Lots of sums can make higher values that end up making a saturated update. Since scaling is a monotic transformation it would not change the optimization point. Simple take your cost and scale it by 0.00001 as example. This solved my problem.

Remark my sum of squared error was normalized by batch size and by my factor. Hope this helps.

An update on the bug, (I run this on Tesla-X GPU).

I do consistently get the error when I use sample_weights. The model has a sparse input size of about 8000 neurons and the first layer is an SQRT reduction of the size.

with tf.device('/cpu:0'):
    width = wide_array_width(wide_col_len_dict)
    reduction = wide_reduce(width)        
    model = Sequential()
    model.add(Dense(reduction, input_dim=width, activation='softplus'))
    if(middle_layer):
        model.add(Dense(wide_reduce(reduction), activation='softplus',W_constraint = maxnorm(2)))
    #final_layer              
    model.add(Dense(1, init='normal',activation='sigmoid'))                                           model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.001,
                                                         beta_1=0.9, 
                                                         beta_2=0.999, 
                                                         epsilon=10e-04, 
                                                         decay=0.0,
                                                         clipnorm=1.0,
                                                         clipvalue=0.3))

The model trains if I comment out the sample_weight section (But trains the model horribly wrong)

    hist = model.fit(input_dense_matrix, 
                     labels, 
                     nb_epoch=train_steps, 
                     verbose=0, 
                     shuffle=True,
                     validation_split=0.2,
                     batch_size=60,
                     sample_weight=sample_weights_,
                     callbacks=[early_stopping, checkpointer])

I am having the same issue for this network:

class FeedForward:

    def __init__(self, input_dim, nb_classes):

        in_x = Input(shape=(input_dim, ), name='in_x')
        h1 = Dense(14, name='h1', activation='tanh')(in_x)
        h2 = Dense(8, name='h2', activation='tanh')(h1)
        out = Dense(nb_classes, name='out', activation='tanh')(h2)

        self.model = Model(input=[in_x], output=[out])

    def compile_model(self, optimizer='adam', loss='mse'):
        self.model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

loss will always be nan unless I wrap everything into with tf.device('/cpu:0'): and run the calculations on the CPU.

I'm having a similar issue with my new Titan X running on TF 1.0.1 using CUDA 8.0 and CuDNN 5.1.10. I have tried clipping the gradients but I've had no luck. My model works fine on CPU but within 100 iterations of mini-batches size 10 I inevitably get NaN values when running on my GPU.

Is it possible this is a problem with my installation of CUDA, CuDNN, of TF? I've tried downloading TF from source to no avail. Has any one had any luck going back to CUDA 7.5 and CuDNN 4?

EDIT: So after a lot of work I found out that this was an error with my code and not with the architecture. Apparently nans can become more prevalent depending on your environment but at its core this seems to be an issue with my model.

I am also getting the issue when I add regularization (for an attention layer). I played with kernel regularization and activity regularization and they are both resulting in nan's. I get nan on both GPU and CPU training

Also having this issue on GeForce GTX 1060. Training on CPU works OK, on GPU loss becomes nan after the first batch update.


Tried multiple versions of cuDNN (all of them were 5005 or more recent though), theano (0.8.something and 0.9.0) and keras (1.2.something and 2.somethig). All had the same problem.


Tried disabling cuDNN through .theanorc and scaling my loss by 0.00001. Neither solved the issue (although I was still seeing the cudnn 5005 in the 'Using gpu' message ...)

Things worth of note:
- loss becomes nan even if learning rate is set to 0.
- my model is Input-Embedding-CNN-CNN-Dense-Output. If I remove one CNN layer, training works again.


Tried running the same program on a Tesla K40c. Same story.


Tried decreasing the batch_size. So far so good, haven't seen the nan error yet. My batch size before was 250. It made the first loss calculationm (0.0853) and then turned to nan at 500/80000. Now I'm using a batch size of 2 (went to the extreme). Im currently at 1000/80000 without any problems. Will try different batch sizes and find the one that works the best for me.

Again, 500 and 1000 are only the chunks of data processed, This is all within the first epoch


Hope this might help people in the future with the same problem I had.

I kept having this issue which was annoying because I would train something overnight and in the morning it was nan. I think I fixed it now, I haven't got nans after about a day of training but I'll update my comment if I do.

To fix this, you have to do three things:

Add a very minor bias and weight regularizer to every layer

model.add(Dense(hiddenSize, kernel_regularizer=l2(0.00001), bias_regularizer=l2(0.00001)))

This is so small it won't really affect your training, it will just ensure the weights and biases don't get massive

Next I did

optimizer = optimizers.Adam(clipnorm=1., clipvalue=0.5)

As described above. Finally, I am using crossentropy loss so I changed it to this:

from keras import losses
def constrainedCrossEntropy(ytrue, ypred):
  ypred = T.clip(ypred, 0.0001, 0.99999)
  return losses.categorical_crossentropy(ytrue, ypred)

model.compile(loss=constrainedCrossEntropy, optimizer=optimizer)

Which ensures the values stay in a reasonable range, because if they get too close to 0 or 1 you will get nans

Edit: I had the parameters flipped for my constrainedCrossEntropy function, fixed that now

I had also problems with train or val-loss turning to nan until I realized that my custom loss function was not capable of handling values bigger than 88 (because exp(89) is to big for float32).

from keras import backend as K

def binary_regression_error(y_true, y_pred):
    return K.mean(K.log(1 + K.exp(K.clip(-y_true*y_pred, -1e40, 88.))))

So clipping solved it for me.

Hi guys,
I don't know what to do anymore. I have all the solution given above but I still experience loss: nan and accuracy: nan with the very small batch of size:50. I am using GeForce GTX 680 with CuDNN version 5105. Below is the error with Tensorflow backend:
35/50 [====================>.........] - ETA: 13s - loss: 9.4006 - acc: 0.1041{'acc': 0.10833333, 'loss': 9.4530907, 'batch': 35, 'size': 32}
36/50 [====================>.........] - ETA: 12s - loss: 9.4020 - acc: 0.1043
i:96
{'acc': 0.110625, 'loss': 9.3898754, 'batch': 36, 'size': 32}
37/50 [=====================>........] - ETA: 11s - loss: 9.4017 - acc: 0.1044{'acc': 0.10916667, 'loss': 9.2677832, 'batch': 37, 'size': 32}
38/50 [=====================>........] - ETA: 10s - loss: 9.3982 - acc: 0.1046{'acc': 0.11254902, 'loss': 9.3335171, 'batch': 38, 'size': 17}
39/50 [======================>.......] - ETA: 9s - loss: 9.3965 - acc: 0.1048 {'acc': nan, 'loss': nan, 'batch': 39, 'size': 0}
40/50 [=======================>......] - ETA: 8s - loss: nan - acc: nan {'acc': nan, 'loss': nan, 'batch': 40, 'size': 0}
41/50 [=======================>......] - ETA: 7s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 41, 'size': 0}
42/50 [========================>.....] - ETA: 6s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 42, 'size': 0}
43/50 [========================>.....] - ETA: 5s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 43, 'size': 0}
44/50 [=========================>....] - ETA: 4s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 44, 'size': 0}
45/50 [==========================>...] - ETA: 3s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 45, 'size': 0}

I change the regularisation and customised loss function as follows:
def constrainedCrossEntropy(x, y):
x = T.clip(x, 0.0001, 0.99999)
return losses.categorical_crossentropy(x, y)

Model

l_conv1 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001), input_shape=(seq_leng,VOCAB_SIZE))(inputs)
l_pool1 = MaxPooling1D(pool_size=pooling_size,padding='same')(l_conv1)
l_conv2 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001)(l_pool1)
l_pool2 = MaxPooling1D(pool_size=pooling_size, padding='same')(l_conv2)

l_conv3 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_pool2)
l_conv4 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu',
kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_conv3)

Please advise me on what to do. Thanks

@hr0nix As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

Just now I had problem with keras ctc loss function on the top of the softmax, I added one more tanh layer before softmax and NaNs are gone!

@MaratZakirov That's a really great point, though wouldn't sigmoid with
some clipping to make sure it isn't 0 or 1 work better? Tanh can produce
negative values which could give you nans again.

On Mon, Jun 26, 2017 at 7:33 AM MaratZakirov notifications@github.com
wrote:

@hr0nix https://github.com/hr0nix As far as I know, it's the
combination of relu and softmax that causes numerical troubles, as relu can
produce large positive values corresponding to very small probabilities. If
you change your model to use, say, tanh instead of relu for the last dense
layer, the problem will go away.

Just now I had problem with keras ctc loss function on the top of the
softmax, I added one more tanh layer before softmax and NaNs are gone!

β€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/1244#issuecomment-311060180,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMAaNEUXX2-h9-fP03Rd6xvfZUdlPkWiks5sH7M0gaJpZM4GzdUg
.

I also faced the same issue with loss variable showing 'nan' while going deep into the layers.

But, I solved the problem by decaying the learning rate for every epoch.

I haven't looked deep into it, but I think this might have to do with the presence of zeros at some point.

The reason is a workaround I found, which seems pretty robust so far: I just added a layer with very small Gaussian noise after each of my layers. NaNs no more.

I also had this problem recently. I have tried loss clip, weight constraint, adding regularizer with small value. None one works. I am doing regression problem and use cuDNN and float64. I use Adam (tried RMSprop, still have this problem).

BTW, I do not control the last layer (the linear layer), Will that be a problem?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Same problem with tensorflow cpu and theano gpu backend. However, same network seems to work perfectly fine with tensorflow-gpu backend. Another problem exists with the gpu backend though, i.e., it can't save weights on checkpoints and the entire python environment crashes out completely.

I encountered the same problem with TensorFlow backend, GTX980ti, on medium size image dataset (60, 60, 1). My loss comprise binary_crossentropy and kl_loss. After applied K.clip to both losses, nan gone.

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.

If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

training on CPU, have tried everything on this page. Inputs are fine; used np.nan_to_num on literally every input...

There are multiple causes of this problem, what happened to me is my input dimension doesn't make indices in sparse_categorial_crossentropy, but when I run it on GPU it doesn't throw an error where running on cpu it throw output dimension doesnt make error.
That causes GPU nan loss error in my case

Hello guys,
I got nan loss when training lstm 1D and 2D:
I already tried reduce the learning rate, add regularization and also make sure there is no NAN in training and validation set, but the loss is always NAN.

loss function kld

def kl_divergence(labels, prediction, epsilon=1e-7):

prediction /= (tf.reduce_sum(prediction, axis=1, keep_dims=True) + epsilon)
labels /= (tf.reduce_sum(labels, axis=1, keep_dims=True) + epsilon)
result = tf.reduce_mean(tf.reduce_sum(labels * tf.log((labels / (prediction + epsilon)) + epsilon), axis=1))
return result

2D lstm model

def lstm_model_2D(time_step, input_size, output_size):

model = Sequential()
model.add(ConvLSTM2D(8, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0,
                     input_shape=(time_step, ) + input_size))

model.add(ConvLSTM2D(4, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))

model.add(ConvLSTM2D(1, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
                     activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
                     kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                     bias_initializer='zeros',
                     unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
                     bias_regularizer=None,
                     activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
                     bias_constraint=None,
                     return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))
model.add(TimeDistributed(Flatten()))
# add output layer
model.add(TimeDistributed(Dense(output_size)))
return model

1D lstm model

def lstm_model_1D(time_step, input_size, output_size):

model = Sequential()
model.add(TimeDistributed(Conv2D(8, (3, 3), activation='relu', padding='same', name='conv1'),
                          input_shape=(time_step, input_size[0], input_size[1], input_size[2])))
model.add(TimeDistributed(Conv2D(1, (3, 3), activation='relu', padding='same', name='conv1')))
model.add(TimeDistributed(Flatten()))
# build a LSTM RNN
model.add(LSTM(
    batch_input_shape=(None, time_step, input_size[0] * input_size[1]),  # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
    output_dim=256,
    return_sequences=True,  # True: output at all steps. False: output as last step.
    activation='relu',
    bias_initializer='RandomUniform',
    dropout=0.5
))
model.add(LSTM(
    batch_input_shape=(None, time_step, input_size[0] * input_size[1]),  # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
    output_dim=256,
    return_sequences=True,  # True: output at all steps. False: output as last step.
    activation='relu',
    bias_initializer='RandomUniform',
    dropout=0.5
))
# add output layer
model.add(TimeDistributed(Dense(output_size)))
return model

Try clip_norm to clip the gradient

On Apr 13, 2018, at 4:45 AM, HuangBo-Terraloupe notifications@github.com wrote:

Hello guys,
I got nan loss when training lstm 1D and 2D:
I already tried reduce the learning rate, add regularization and also make sure there is no NAN in training and validation set, but the loss is always NAN.

def kl_divergence(labels, prediction, epsilon=1e-7):

normalize to distribution

prediction /= (tf.reduce_sum(prediction, axis=1, keep_dims=True) + epsilon)
labels /= (tf.reduce_sum(labels, axis=1, keep_dims=True) + epsilon)
result = tf.reduce_mean(tf.reduce_sum(labels * tf.log((labels / (prediction + epsilon)) + epsilon), axis=1))
return result

def lstm_model_2D(time_step, input_size, output_size):

model = Sequential()
model.add(ConvLSTM2D(8, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
bias_initializer='zeros',
unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
bias_regularizer=None,
activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
bias_constraint=None,
return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0,
input_shape=(time_step, ) + input_size))

model.add(ConvLSTM2D(4, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
bias_initializer='zeros',
unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
bias_regularizer=None,
activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
bias_constraint=None,
return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))

model.add(ConvLSTM2D(1, (3, 3), strides=(1, 1), padding='same', data_format=None, dilation_rate=(1, 1),
activation='relu', recurrent_activation='hard_sigmoid', use_bias=True,
kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
bias_initializer='zeros',
unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None,
bias_regularizer=None,
activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
bias_constraint=None,
return_sequences=True, go_backwards=False, stateful=False, dropout=0.0, recurrent_dropout=0.0))
model.add(TimeDistributed(Flatten()))

add output layer

model.add(TimeDistributed(Dense(output_size)))
return model
def lstm_model_1D(time_step, input_size, output_size):
model = Sequential()
model.add(TimeDistributed(Conv2D(8, (3, 3), activation='relu', padding='same', name='conv1'),
input_shape=(time_step, input_size[0], input_size[1], input_size[2])))
model.add(TimeDistributed(Conv2D(1, (3, 3), activation='relu', padding='same', name='conv1')))
model.add(TimeDistributed(Flatten()))

build a LSTM RNN

model.add(LSTM(
batch_input_shape=(None, time_step, input_size[0] * input_size[1]), # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
output_dim=256,
return_sequences=True, # True: output at all steps. False: output as last step.
activation='relu',
bias_initializer='RandomUniform',
dropout=0.5
))
model.add(LSTM(
batch_input_shape=(None, time_step, input_size[0] * input_size[1]), # Or: input_dim=INPUT_SIZE, input_length=TIME_STEPS,
output_dim=256,
return_sequences=True, # True: output at all steps. False: output as last step.
activation='relu',
bias_initializer='RandomUniform',
dropout=0.5
))

add output layer

model.add(TimeDistributed(Dense(output_size)))
return model

β€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

If you have custom layers, check for possible over/underflow with exp/log. For example, I modified Softmax to include a temperature term, but forgot to subtract max off first for numerical stability.

Dear all,

I encountered same problem on my laptop with MacBook Pro (Retina, 15-inch, Late 2013) Intel Iris Pro 1536 MB graphic card! Problem solved by disabling gpu configuration.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

My keras simple NN code:

def basic_model_1(x_size, y_size):
t_model = Sequential()
t_model.add(Dense(100, activation="tanh", input_shape=(x_size,)))
t_model.add(Dense(50, activation="relu"))
t_model.add(Dense(y_size))
print(t_model.summary())
t_model.compile(loss='mean_squared_error',
optimizer=Adam(),
metrics=[metrics.mae])
return(t_model)

model = basic_model_1(arr_x_train.shape[1], arr_y_train.shape[1])

%%time
history = model.fit(arr_x_train, arr_y_train,
batch_size=128,
epochs=500,
shuffle=True,
verbose=1,
validation_data=(arr_x_valid, arr_y_valid),
callbacks=[EarlyStopping(monitor='val_loss', patience=20)])

For future reference:
I had this problem with WGAN-GP network with some GRU layers, problem went away after adding clip_norm=1.0 to optimizer and removing any data containing non-finite numbers from dataset.
Lesson learned:
Always use clip_norm when you have any recurrent layers

edit:
Don't set clip_norm to 1!! Set it to some reasonable value like 5-6 if you have this problem, maybe even higher! Clip norm is a threshold where lower means a higher effect of clipping, so setting it low like 1 means a drastic effect and likely very strong vanishing gradients

I ended up with a clip_norm value of 3.0 that seemed to fix nan loss without gradient vanishing completely

Did you try to compile your model using the 'adam' optimizer?
Worked for me, as explained on this post https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network

I thought I have this problem but it turns out to be an issue with nb_classes. There were a few samples out of vocabulary (the integers were larger than maximum nb_classes). sparse_categorial_crossentropy doesn't produce a proper error message about this, I saw nan half the way through the first epoch. Changing the number of classes in the final layer fixed the error for me.

You may try to switch to precision float64 instead of float32 (keras.json,
floatx parameter)
However, on my CPU, the option was to switch to Adam optimizer (Gradient
descent might be the key)

Regards

On Tue, Aug 28, 2018 at 1:36 PM Mehdi notifications@github.com wrote:

I thought I have this problem but it turns out to be an issue with
nb_classes. There were a few samples with out of vocabulary words
(integers larger than nb_classes). sparse_categorial_crossentropy didn't
give me the right error, I saw nan middle of the first epoch.

β€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/keras-team/keras/issues/1244#issuecomment-416551697,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABM0F88BAhf7JpeM6b1cpx2sSJE0cV4oks5uVSssgaJpZM4GzdUg
.

I solved this issue by adding a batch normalization layer between linear and nonlinear layers.

I had the same problem, I point it here in this link:
https://github.com/keras-team/keras/issues/2134

For me, it worked after I changed everything to the same type (float32) and rebooted my machine.

like @9thDimension said, upgrading Theano to the bleeding-edge version (0.9.0-dev2) seems to have fixed the nan issues for me so far on debian wheezy. i'm using a python 3.5.2 env in anaconda 4.1.1.

i just followed the instructions from the theano website here:
pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

...this worked for me, in addition to batch normalization. I'm. using Adam as a normalizer, and int8 as the type. But the this comment was from 2016, and the Theano versions are different today.

Found existing installation: Theano 1.0.3+2.g3e47d39.dirty Uninstalling Theano-1.0.3+2.g3e47d39.dirty: Successfully uninstalled Theano-1.0.3+2.g3e47d39.dirty Successfully installed Theano-1.0.3+29.g31b853fbd

I have no idea why this worked, but it did.

Edit: didn't work, nan'd out after 8 hours.

Got this problem on TF backend, while using multi_gpu_model to train on 2x GPUs. Switching to single GPU "solved" the problem.

Most of the answer goes in the wrong direction. Single "nan" is totally another common issue. And most of them are due to gradient explosion. While this issue should talk about "CPU ok but GPU gets nan." which obviously cannot blame on the gradients. Those who get 'NaN' on CPU should not ask for solution in here.
As for me, I solved it by downgrading my TensorFlow version from 1.14.0 to 1.13.1, it seems this problem is caused by some bugs on Adam optimizer in some versions of TensorFlow. Which may refer to this issue. Hope it helps.

@hr0nix Thanks, It works. But instead of changing relu to tanh, I first added a BN before relu to wrap numerical numbers into a lower range, it still gives me NaN loss, why is that? Cause I think the idea of both solutions is the same.

@hr0nix Thanks, It works. But instead of changing relu to tanh, I first added a BN before relu to wrap numerical numbers into a lower range, it still gives me NaN loss, why is that? Cause I think the idea of both solutions is the same.

@hr0nix No, tanh still gives NaN loss, but after more steps.

NAN normally caused by numerical overflow, means either you have 0 gradience or zero divisions, try use batch normalization on all layers that you need to calculate gradience , and also clipping to cap the gradience value.

On Aug 5, 2019, at 6:50 PM, xtluo-ai notifications@github.com wrote:

@hr0nix Thanks, It works. But instead of changing relu to tanh, I first added a BN before relu to wrap numerical numbers into a lower range, it still gives me NaN loss, why is that? Cause I think the idea of both solutions is the same.

@hr0nix No, tanh still gives NaN loss, but after more steps.

β€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

Actually, I did apply BN for all layers. I think the problem for me is the softmax:

# suppose I have a layer x with shape [-1, -1, 16]

# Normal
x = tf.nn.softmax(x)

# NaN loss on v100 GPU, normal on CPU
x = tf.nn.softmax(x, axis=1)

NAN normally caused by numerical overflow, means either you have 0 gradience or zero divisions, try use batch normalization on all layers that you need to calculate gradience , and also clipping to cap the gradience value.
…
On Aug 5, 2019, at 6:50 PM, xtluo-ai @.*> wrote: @hr0nix Thanks, It works. But instead of changing relu to tanh, I first added a BN before relu to wrap numerical numbers into a lower range, it still gives me NaN loss, why is that? Cause I think the idea of both solutions is the same. @hr0nix No, tanh still gives NaN loss, but after more steps. β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Yeah that happened to me as well, why you use -1 as shape? What happened to me before is I have a wrong shape, cpu will throw an error but GPU compute Nan

On Aug 7, 2019, at 2:17 AM, xtluo-ai notifications@github.com wrote:

Actually, I did apply BN for all layers. I think the problem for me is the softmax:

suppose I have a layer x with shape [-1, -1, 16]

Normal

x = tf.nn.softmax(x)

NaN loss on v100 GPU, normal on CPU

x = tf.nn.softmax(x, axis=1)
NAN normally caused by numerical overflow, means either you have 0 gradience or zero divisions, try use batch normalization on all layers that you need to calculate gradience , and also clipping to cap the gradience value.
…
On Aug 5, 2019, at 6:50 PM, xtluo-ai @.*> wrote: @hr0nix Thanks, It works. But instead of changing relu to tanh, I first added a BN before relu to wrap numerical numbers into a lower range, it still gives me NaN loss, why is that? Cause I think the idea of both solutions is the same. @hr0nix No, tanh still gives NaN loss, but after more steps. β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

β€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

-1 to represent dynamic shape, let's say [100, 200].

Yeah that happened to me as well, why you use -1 as shape? What happened to me before is I have a wrong shape, cpu will throw an error but GPU compute Nan
…
On Aug 7, 2019, at 2:17 AM, xtluo-ai @.> wrote: Actually, I did apply BN for all layers. I think the problem for me is the softmax: # suppose I have a layer x with shape [-1, -1, 16] # Normal x = tf.nn.softmax(x) # NaN loss on v100 GPU, normal on CPU x = tf.nn.softmax(x, axis=1) NAN normally caused by numerical overflow, means either you have 0 gradience or zero divisions, try use batch normalization on all layers that you need to calculate gradience , and also clipping to cap the gradience value. … On Aug 5, 2019, at 6:50 PM, xtluo-ai @.> wrote: @hr0nix Thanks, It works. But instead of changing relu to tanh, I first added a BN before relu to wrap numerical numbers into a lower range, it still gives me NaN loss, why is that? Cause I think the idea of both solutions is the same. @hr0nix No, tanh still gives NaN loss, but after more steps. β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

this is still happening I am using CUDA 10 and TensorFlow 1.14 Keras 2.2.5, this should be a high priority to fix, it's a killing bug, it's strange to be open for 4 Years! and no one tries to fix it

It's because it has multiple causes.. can't narrow down to a set

On Aug 26, 2019, at 12:28 AM, Abdelrahman Sayed notifications@github.com wrote:

this is still happening I am using CUDA 10 and TensorFlow 1.14 Keras 2.2.5, this should be a high priority to fix, it's a killing bug, it's strange to be open for 4 Years !

β€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

I also tried mse as loss function, which ran into 'nan' aswell.

In that case the overflow is happening earlier in the graph.

Next, you could try removing the regularizers.

BTW the history callback is included by default, no need to specify it manually.

I remove activity_regularizer and it fixed. My Keras version is 2.1.5

I was having the same issue. Nan loss on GPU but not on CPU.
I double-checked my data and everything looked fine.
But I also noticed a certain error message: "Failed to get device properties, error code: 30"

Both issues appear to have been fixed after I updated to the latest available Nvidia driver.

I hope this helps someone. I could not find the cause of the problem but it's fair to say it somehow correlates with the driver I had installed.

A year passed, the bug is still there. Steps to reproduce:

  1. Xception classifier from Keras/Applications
  2. Adding l2 weights regularizer to convolutional layers (as described in original paper, but missing in implementation)
    Training on 1 GPU: ok
    Training on >1 GPU: loss nan after 2-3 hours
    Training without L2 reg on >1 GPU: ok
    Confirmed for both Adam and RMSprop.

I am having a similar issue, but this is with a multi-output model. I am using _TF2.0 with Keras_ model layers. The model is trained on a _single GPU_ machine using _CUDA 10.0_. During training after a few epochs, individual losses are finite numbers but the total loss turns to nan.

Total loss is given as loss. There are 5 individual lossesxout_loss, yout_<n>_loss, n = [0,1,2,3].

My loss weight is configured as [1., 0.1, 0.1, 0.1, 0.1]

The model uses convolutional layers with relu activation, l2 kernel_regularizers (1e-4)

`--------------------------------------------------`
STARTING EPOCH: 18
             loss:   nan,
    xout_loss:2.968e-02,
yout_0_loss:9.024e-01,  yout_1_loss:9.026e-01,  yout_2_loss:9.023e-01,  yout_3_loss:9.024e-01,
--------------------------------------------------

What I fail to understand is, how can the total loss become nan when the individual components are not. Is this a bug or a problem with implementation.

I experienced this issue when running my temporal convolution network (TCN) model with sparse_categorical_crossentropy loss function on the GPU. Keras with TF backend. I changed the function to kullback_leibler_divergence and the problem of NaN values was gone.

Loss turns into 'nan' may be caused by a simple issue: the wrong output size of the model. I had the issue when I wrongly count the unique labels in the training set. For example, the unique training labels aren't continuous, like unique_labels = [0, 2, 5, 6, ...], and I used the model output shape = [None, len(unique_labels)]... such a stupid miscalculation... Then I changed output shape = [None, max(unique_labels) + 1], and I got the losses back. Another way to do it is to convert non-continuous labels into continuous ones.

Same problem here. Training on a google cloud VM with 4 T4 Teslas:

The problem is described more in my SO question:

https://stackoverflow.com/questions/61274792/model-suddenly-forgets-all-it-has-learned-and-stops-working-at-around-110-epoc?noredirect=1#comment108399684_61274792

EDIT: My bad, I missed the activation function off the 6th line. This solved my problem.

Same issue with Adam, Relu, CPU and tensorflow backend. It worked on GPU before.

I got the same problem, however, my problem was more simple:
My dataset was labeled with [0, 2] instead of [0,1] (only two labels), my model sends logits accordingly to the number of labels and the my Loss function (SparseCategoricalCrossentropy) expected 3 labels (0, 1 and 2). I fixed my error by changing my label 2 -> 1.

I have the same problem. I train an LSTM network with my own data. The train_loss becomes NaN suddenly. I checked my code with imdb dataset. It is working OK. But when I switch to my dataset nan problem occurs. I preprocessed my data in the same way that imdb dataset preprocessed in imdb_lstm example of keras. I do not understand what the problem is. It seems that network configuration is OK since it run with another dataset. However, my dataset and imdb dataset are both text. How come does another text dataset cause this issue? I tried gradient clipping also weight norm limitations. I think sudden change happens when inf value is calculated with categorical_cross entropy function such as log(0). But how can I determine and avoid this problem?

Same issue with me. Have you found any solution?

Most of the answer goes in the wrong direction. Single "nan" is totally another common issue. And most of them are due to gradient explosion. While this issue should talk about "CPU ok but GPU gets nan." which obviously cannot blame on the gradients. Those who get 'NaN' on CPU should not ask for solution in here.
As for me, I solved it by downgrading my TensorFlow version from 1.14.0 to 1.13.1, it seems this problem is caused by some bugs on Adam optimizer in some versions of TensorFlow. Which may refer to this issue. Hope it helps.

This was a great clue as I noticed that my code and parameters worked fine on CPU but on GPU would get inf validation loss. Seems like there's a bug with Adamax as suggested above. I switched from Adamax to Adam and it works on GPUs now. I'm on tf 1.15

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zygmuntz picture zygmuntz  Β·  3Comments

oweingrod picture oweingrod  Β·  3Comments

harishkrishnav picture harishkrishnav  Β·  3Comments

snakeztc picture snakeztc  Β·  3Comments

Imorton-zd picture Imorton-zd  Β·  3Comments