Tfjs: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Created on 29 Mar 2019 · 8Comments · Source: tensorflow/tfjs

To get help from the community, we encourage using Stack Overflow and the tensorflow.js tag.

TensorFlow.js version

{ 'tfjs-core': '1.0.3',
'tfjs-data': '1.0.3',
'tfjs-layers': '1.0.3',
'tfjs-converter': '1.0.3',
tfjs: '1.0.3',
'tfjs-node': '1.0.2' }

Browser version

Running on node
Ubuntu 18.04

$ nvidia-smi
Fri Mar 29 19:25:37 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| N/A 46C P8 9W / N/A | 879MiB / 7952MiB | 3% Default |
+-------------------------------+----------------------+----------------------+

Describe the problem or feature request

I'm unable to use cudnn convolutional layers in my model on tfjs-node-gpu
Possibly related due to issues with RTX series in this tensorflow workaround there is suggestion to use
config.gpu_options.allow_growth = True

Is there such option in tensorflow js?

Code to reproduce the bug / link to feature request

const tf = require('@tensorflow/tfjs-node-gpu');
const model =  tf.sequential({
    layers: [      
      tf.layers.conv2d({
        inputShape:[32, 32, 3],
        filters: 32, 
        kernelSize: [3, 3],
        activation: 'relu',
      }),
      tf.layers.maxPooling2d([2, 2]),      
    ],
  });
model.predict(tf.randomNormal([4, 32, 32, 3]))
     .then((res) => {
         res.print();
     })

$ node index.js
2019-03-29 19:22:37.112495: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-03-29 19:22:37.249964: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-29 19:22:37.250443: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3aa4000 executing computations on platform CUDA. Devices:
2019-03-29 19:22:37.250458: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-03-29 19:22:37.271245: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-03-29 19:22:37.271958: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3aa2750 executing computations on platform Host. Devices:
2019-03-29 19:22:37.271972: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-03-29 19:22:37.272241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
totalMemory: 7.77GiB freeMemory: 6.80GiB
2019-03-29 19:22:37.272275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-29 19:22:37.273295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 19:22:37.273308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-29 19:22:37.273314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-29 19:22:37.273435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6612 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-03-29 19:22:38.761993: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-29 19:22:38.763178: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:132
throw ex;
^

Error: Invalid TF_Status: 2
Message: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
at NodeJSKernelBackend.executeSingleOutput (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:192:43)
at NodeJSKernelBackend.conv2d (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:700:21)
at environment_1.ENV.engine.runKernel.x (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/conv.js:152:27)
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:171:26
at Engine.scopedRun (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:126:23)
at Engine.runKernel (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:169:14)
at conv2d_ (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/conv.js:151:40)
at Object.conv2d (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/operation.js:46:29)
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-layers/dist/layers/convolutional.js:198:17
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:116:22

node.js

Source

bobiblazeski

👍3

Most helpful comment

As explained in https://github.com/tensorflow/tfjs/issues/671#issuecomment-494832790

There is workaround by setting global variable
export TF_FORCE_GPU_ALLOW_GROWTH=true

piercus on 1 Oct 2019

👍6

All 8 comments

Same error happens even when there is no convolutional layers in model.
Models

const actor = () => tf.sequential({
    layers: [
      tf.layers.inputLayer({inputShape: STATE_SIZE}),
      tf.layers.batchNormalization(),
      tf.layers.dense({units: ACTION_SIZE*2, activation:'relu'}),
      tf.layers.dense({units: ACTION_SIZE, activation:'softmax'}),
    ],
  });

  const critic = () => {
    const stateInput = tf.input({shape: [STATE_SIZE]});
    const actionInput = tf.input({shape: [ACTION_SIZE]});
    const bn = tf.layers.batchNormalization().apply(stateInput);
    const d1 = tf.layers.dense({units: ACTION_SIZE*2, activation: 'relu'})
      .apply(bn);
    const d2 = tf.layers.dense({units: ACTION_SIZE,
      activation: 'softmax'}).apply(d1);
    const concat = tf.layers.concatenate().apply([d2, actionInput]);
    const d3 = tf.layers.dense({units: ACTION_SIZE, 
      activation: 'relu'}).apply(concat);
    const output = tf.layers.dense({units: 1}).apply(d3);
    return tf.model({inputs: [stateInput, actionInput], outputs: output});
  }

$ node server/start.js
2019-04-03 20:26:24.022854: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-03 20:26:24.151743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-03 20:26:24.152219: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3ac43c0 executing computations on platform CUDA. Devices:
2019-04-03 20:26:24.152233: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-04-03 20:26:24.171244: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-04-03 20:26:24.171685: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3ac2b10 executing computations on platform Host. Devices:
2019-04-03 20:26:24.171699: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-04-03 20:26:24.171843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
totalMemory: 7.77GiB freeMemory: 6.57GiB
2019-04-03 20:26:24.171855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-03 20:26:24.172565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-03 20:26:24.172575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-03 20:26:24.172579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-03 20:26:24.172688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6389 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Starting with random weights.
(node:20980) ExperimentalWarning: The fs.promises API is experimental
Listening on 3000
connection
2019-04-03 20:26:27.335442: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-03 20:26:27.335505: W ./tensorflow/stream_executor/stream.h:2099] attempting to perform DNN operation using StreamExecutor without DNN support
2019-04-03 20:26:27.346775: I tensorflow/stream_executor/stream.cc:2079] [stream=0x4a7f370,impl=0x4a7f410] did not wait for [stream=0x4a7ed90,impl=0x4a76260]
2019-04-03 20:26:27.346799: I tensorflow/stream_executor/stream.cc:5027] [stream=0x4a7f370,impl=0x4a7f410] did not memcpy host-to-device; source: 0x4a02a980
2019-04-03 20:26:27.346837: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed

bobiblazeski on 3 Apr 2019

@bobiblazeski, have you found any resolution to this issue?

I'm currently blocked by this same error.
Ubuntu 18.04
GTX 1660; Driver 418.56; CUDA 10.1 (even though I followed the instructions for 10.0...)

adwellj on 9 Apr 2019

@adwellj Nope I'm training on CPU until this is resolved.

bobiblazeski on 9 Apr 2019

@bobiblazeski, I punted over to trying on Windows and finally just got this working. I had to drop down to tfjs-node-gpu version 0.3.2 due to node-gyp issues.

However, once I finally got it to install, I later ran in to this same CuDNN issue! Fortunately, using CUDA 9.0 (needed for 0.3.2 compatibility) I got a better error message before the "This is probably because cuDNN failed to initialize..." message, stating that tfjs-node-gpu was built against CuDNN version 7.2. Once I downloaded that version, everything is working.

I haven't went back to see if I could get it to work on the LINUX install, but I'm hoping that this could just be a CuDNN version incompatibility issue that you could experiment with. Luckily CuDNN doesn't have an install / uninstall process; it's simply copying the extracted files in to a dedicated directory that you include in your system path.

I hope that helps give you some possible direction!

adwellj on 10 Apr 2019

👍2

As explained in https://github.com/tensorflow/tfjs/issues/671#issuecomment-494832790

There is workaround by setting global variable
export TF_FORCE_GPU_ALLOW_GROWTH=true

piercus on 1 Oct 2019

👍6

@adwellj Nope I'm training on CPU until this is resolved.

have you solve to use GPU ?

3lk0k0 on 30 Jan 2020

I am having this issue too, but it seems to resolve itself only when I restart my computer. This seems rather odd to me. I notice that the issue tends to happen after terminating my application(s) that utilize tfjs.

EDIT: I tried adding TF_FORCE_GPU_ALLOW_GROWTH=true as an environment variable, and it seemed to have worked briefly, but upon trying to run my program once more, the error started appearing again.