Keras: Crash when using multi_gpu_model and n_sample is not a multiple of batch_size

Created on 19 Oct 2018 · 16Comments · Source: keras-team/keras

I had this error when trying to fit a multi_gpu_model that fits just fine on a single GPU:

F tensorflow/stream_executor/cuda/cuda_dnn.cc:522] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

After some investigations (because I was trying to fill a decent bug report), it turns out this happens when I try to fit it using a batch_size that is not a divisor of the number of samples in my dataset.

I originally asked for help on SO here, where you can read more details. Maybe this cannot be changed, but my suggestion would be to have a more intelligible error message for a regular human like me. ;o)

To investigate

Source

truenicoco

👍8

Most helpful comment

I think it's worth noting, since this issue pops up first when googling this error: it also happens in general case, when one tries to pass too small spatial input, presumably, for pooling layers.

apatsekin on 14 Feb 2019

👍4

All 16 comments

More so, it will also happen, if the validation sample size is not a multiple of batch size.
I haven't had this before (could be luck) but only after upgrading to the latest cudnn version.

kfeeeeee on 19 Nov 2018

👀1

Indeed @kfeeeeee , I forgot to mention it but I can confirm this also happens for me.

truenicoco on 19 Nov 2018

I think it's worth noting, since this issue pops up first when googling this error: it also happens in general case, when one tries to pass too small spatial input, presumably, for pooling layers.

apatsekin on 14 Feb 2019

👍4

I can confirm this issue on my machine as well. After looking at this post, I ran with the single gpu configuration and it started working. With multi-gpu (8x Nvidia K80), it doesn't work.

abdullahshafin on 13 Mar 2019

I can confirm this issue on my machine as well. After looking at this post, I ran with the single gpu configuration and it started working. With multi-gpu (8x Nvidia K80), it doesn't work.

Update:
My issue of multigpu support was fixed. I was using a batch size of 64. For the last batch, I had 38 images. However, my generator was giving 64 images (in each batch, the images were casted into a new variable that had 64 as 1st dimension) and 38 labels. Fixing the images to 38 to labels already to 38, the multigpu was automatically fixed.

The weird thing is that single gpu was still working and I only figured out this issue when the 1st epoch was about to end and the last batch was processed. This implies that somehow, when enabling multigpu support, there's some additional logic that makes these 'forward' checks before even beginning the first epoch.

My configuration:
Keras: 2.2.4
Python: 3.6.8
Machine: Google VM
CPU Platform: Intel Haswell
GPU: 8x Nvidia K80
RAM: 208 GB

abdullahshafin on 14 Mar 2019

👍1

I am also having this problem if anyone can help. When my code reaches the train_on_batch(X,Y) i get this error:

2019-03-17 19:16:58.883468: F tensorflow/stream_executor/cuda/cuda_dnn.cc:542] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 76 spatial: 0 120 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

Than my code crashes.

nohaghatwary on 17 Mar 2019

👀3

Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 32 feature_map_count: 288 spatial: 0 0 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
Aborted (core dumped)

leokwu on 9 Apr 2019

Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 128 feature_map_count: 768 spatial: 0 1 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

Same issue!

sohanj12 on 17 Apr 2019

I did not debug too much, but I got this error when my input-images had too little resolution (in particular 41 x 41 pixels). I would guess all the down-pooling leads to a 0x0 dimension, which crashes python. An error would be nicer though.

Minimal failing colab: https://colab.research.google.com/drive/11wClV9iD1IVcu09zCn6jA-d1_FhsVvtl

from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K

base_model = InceptionV3(weights='imagenet', include_top=False)

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(2, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=predictions)

for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

import numpy
from sklearn.model_selection import train_test_split

X = numpy.random.rand(500, 41, 41, 3)

y = numpy.random.rand(500) > .5

X_train, X_test, y_train, y_test = train_test_split(X, y)

model.fit(X_train[:2], y_train[:2])

# Log error:
# Jul 8, 2019, 12:01:43 PM  WARNING 2019-07-08 10:01:43.055132: F tensorflow/stream_executor/cuda/cuda_dnn.cc:516] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 288 spatial: %d 0%d 0 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

prhbrt on 8 Jul 2019

👍1

I believe the solution to my multi GPU issue was to feed a sample size that is a multiple of how many GPUs I was using, so the data is fed into the model in pairs for a 2 GPU setup.

aml5 on 20 Jul 2019

👍1

Is it even possible to try-except this?

dimitry12 on 29 Aug 2019

Can someone confirm whether n_sample is the size of the training/validation data? E.g the length of those data arrays? I tried to make the length of those arrays multiples of 64 (my batch size), but i'm still getting an error

echan00 on 11 Sep 2019

Minimal failing colab: https://colab.research.google.com/drive/11wClV9iD1IVcu09zCn6jA-d1_FhsVvtl

from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K

base_model = InceptionV3(weights='imagenet', include_top=False)

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(2, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=predictions)

for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

import numpy
from sklearn.model_selection import train_test_split

X = numpy.random.rand(500, 41, 41, 3)

y = numpy.random.rand(500) > .5

X_train, X_test, y_train, y_test = train_test_split(X, y)

model.fit(X_train[:2], y_train[:2])

# Log error:
# Jul 8, 2019, 12:01:43 PM    WARNING 2019-07-08 10:01:43.055132: F tensorflow/stream_executor/cuda/cuda_dnn.cc:516] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 288 spatial: %d 0%d 0 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

I think I have the same cause as yours.
I am using CIFAR10, the images are 32*32 resolution.

Just wondering what is your fix?

Thanks

franva on 7 Dec 2019

@franva, this thread is reporting errors with multi gpu. Seems like from your error and the one you are replying to, that this is not a multi gpu issue. So, you are probably posting in the wrong spot.

If your error is the same as the one you are copying, it has to do with the shape of your network layers and/or the number of down convolutions, and/or pooling layers.

If you use a pooling or downsampking convolution that drops the resolution by 1/2, then your original image size should at minimum have dimensions that are 2*n, where n is the number of down convolutions/pooling layers.

E.g, if you have 5 pooling or downsampling convolutions, then your original image resolution should be at least 5*2 = 10 (in each dimension).

gattia on 7 Dec 2019

@gattia thanks, so I think I need to resize the sample images om 32*32 to InceptionV3 required minimal resolution. could you please help with some code?
My code is here:
https://stackoverflow.com/questions/59222403/keras-use-trained-inceptionv3-model-cifar10-got-error-about-batch-size

franva on 7 Dec 2019

https://stackoverflow.com/questions/41733210/what-size-should-my-image-be-to-retrain-inception

Try that link. There will be many other resources available online for how to resize an image. That was the first or second hit in a google search.

gattia on 7 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings