Keras: Segfault in keras with tensorflow backend

Created on 20 Jun 2018 · 17Comments · Source: keras-team/keras

Hello,
I got one segfault so many times. I've just installed gdb and got a back trace below. My system is a fresh Ubuntu 16.04 (newly installed), keras 2.2, tf 1.9.0rc1, numpy 1.14.5 (compiled from source). Please help.
Thanks in advance.

`Thread 20 "python3.5" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff779bf700 (LWP 2911)]
0x0000000000000045 in ?? ()
(gdb) bt

0 0x0000000000000045 in ?? ()

1 0x00007fffaf185466 in tensorflow::Tensor::~Tensor() ()

from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

2 0x00007fffaf3180db in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()

from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

3 0x00007fffaf319a2a in std::_Function_handler const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()

from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

4 0x00007fffaf377fba in Eigen::NonBlockingThreadPoolTempl::WorkerLoop(int) ()

from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

5 0x00007fffaf377062 in std::_Function_handler

from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflo---Type to continue, or q to quit---
w_framework.so

6 0x00007ffff253bc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

7 0x00007ffff7bc16ba in start_thread (arg=0x7fff779bf700)

at pthread_create.c:333

8 0x00007ffff78f741d in clone ()

at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb)
`

Source

anhnt1

👍1

Most helpful comment

I tried what @yazabaza posted; I was on Keras 2.2.4 and tensorflow 1.12.0 and was getting the segmentation faults. Downgraded to keras 2.1.6 and Tensorflow 1.8.0 and the errors stopped.

aldarm on 20 Jan 2019

🎉2 👍2

All 17 comments

Sometimes it's a hardware problem. Do you have segfault also on other libraries (e.g. PyTorch), or does it occur only on TF+Keras?

yuyang-huang on 20 Jun 2018

@yuyang-huang - I do not think it is a hardware issue. I used it for quite a while (extensively), without any problems. But I haven't tried other libraries yet.

anhnt1 on 20 Jun 2018

Hi,
Source code and data files attached. I am sure you will get the same segfault. Please help.
Tuan
files.tar.gz

anhnt1 on 20 Jun 2018

You've applied StandardScaler to the input of Embedding layer. The negative values lead to the SEGFAULT in TF.

yuyang-huang on 20 Jun 2018

@yuyang-huang - Thanks, I do not know this. But I just added standardscaler this morning. Segfault happens all the time since 2 weeks ago. It runs few tens epochs (20-200) then dies.

anhnt1 on 20 Jun 2018

I was wrong, then. Seems like negative values are acceptable to tensorflow-gpu. It just doesn't pass on the CPU version.

yuyang-huang on 20 Jun 2018

They are accepted but there is no reason to standard scale Embedding inputs + it is still a very dangerous behavior in my opinion.

Maybe try decreasing the batch size and see if that helps?

tRosenflanz on 20 Jun 2018

My GPU has 11G RAM, and my model is (correct me if I am wrong):

parameters: 4,596,739 --> 4,596,739 * 4* 3 approx. 52M for model
activations (including the input): 20569 --> 128 * 20569 * 4 approx. 10M
I removed StandardScaler, still the same (as many times before). Reduce batch_size to 64, same result. I also check cuda (9.0) and cudnn (7.0.5). I also check GPU (cudamemtest --stress) for 7h, no errors.

anhnt1 on 21 Jun 2018

Okay, just checking... Try downgrading to Tensorflow 1.8 - maybe the release candidate is not stable (doubt that will help though). I had segfault happen because of issues with BLAS before but I doubt that is the case for you. From the stack trace it really doesn't look like an error with Keras to be honest

tRosenflanz on 21 Jun 2018

I observe that keras 2.2.0 + tensorflow 1.8.0 consistently produces "Segmentation fault: 11" while keras 2.1.6 + tensorflow 1.8.0 runs fine. I am running conda version : 4.5.5, conda-build version : 3.10.5.. My conda environment package lists are below. The first one (with keras 2.1.6) runs OK, the second (keras 2.2.0) throws segmentation faults. This is on MacOS 10.13.5. I also observe seg fault on Ubuntu 18.04, keras 2.2.0, tensorflow-gpu 1.8.0, and like the Mac install, reverting to keras 2.1.6 cured the seg faults.

```
packages in environment at /Users/r/anaconda3/envs/tfa:
Name Version Build Channel
absl-py 0.2.2 py_0 conda-forge
astor 0.6.2 py_0 conda-forge
blas 1.0 mkl
bleach 1.5.0 py36_0 conda-forge
ca-certificates 2018.03.07 0
certifi 2018.4.16 py36_0
gast 0.2.0 py_0 conda-forge
grpcio 1.12.1 py36hd9629dc_0
h5py 2.8.0 py36ha8ecd60_0
hdf5 1.10.2 hfa1e0ec_1
html5lib 0.9999999 py36_0 conda-forge
intel-openmp 2018.0.3 0
keras 2.1.6 py36_0
libcxx 4.0.1 h579ed51_0
libcxxabi 4.0.1 hebd6815_0
libedit 3.1.20170329 hb402a30_2
libffi 3.2.1 h475c297_4
libgfortran 3.0.1 h93005f0_2
libprotobuf 3.5.2 hd28b015_1 conda-forge
markdown 2.6.11 py_0 conda-forge
mkl 2018.0.3 1
mkl_fft 1.0.1 py36h917ab60_0
mkl_random 1.0.1 py36h78cc56f_0
ncurses 6.1 h0a44026_0
numpy 1.14.5 py36h9bb19eb_3
numpy-base 1.14.5 py36ha9ae307_3
openssl 1.0.2o h26aff7b_0
pandas 0.23.1 py36h1702cab_0
pip 10.0.1 py36_0
protobuf 3.5.2 py36_0 conda-forge
psutil 5.4.6 py36h1de35cc_0
python 3.6.6 hc167b69_0
python-dateutil 2.7.3 py36_0
pytz 2018.5 py36_0
pyyaml 3.12 py36h2ba1e63_1
readline 7.0 hc1231fa_4
scikit-learn 0.19.1 py36hffbff8c_0
scipy 1.1.0 py36hcaad992_0
setuptools 39.2.0 py36_0
six 1.11.0 py36h0e22d5e_1
sqlite 3.24.0 ha441bb4_0
tensorboard 1.8.0 py36_1 conda-forge
tensorflow 1.8.0 py36_1 conda-forge
termcolor 1.1.0 py_2 conda-forge
time 1.7 0 conda-forge
tk 8.6.7 h35a86e2_3
webencodings 0.5.1 py36_0 conda-forge
werkzeug 0.14.1 py_0 conda-forge
wheel 0.31.1 py36_0
xz 5.2.4 h1de35cc_4
yaml 0.1.7 hc338f04_2
zlib 1.2.11 hf3cbc9b_2

packages in environment at /Users/r/anaconda3/envs/tfb:

Name Version Build Channel
absl-py 0.2.2 py36_0
astor 0.6.2 py36_0
blas 1.0 mkl
bleach 1.5.0 py36_0
ca-certificates 2018.03.07 0
certifi 2018.4.16 py36_0
gast 0.2.0 py36_0
grpcio 1.12.1 py36hd9629dc_0
h5py 2.8.0 py36ha8ecd60_0
hdf5 1.10.2 hfa1e0ec_1
html5lib 0.9999999 py36_0 conda-forge
intel-openmp 2018.0.3 0
keras 2.2.0 0
keras-applications 1.0.2 py36_0
keras-base 2.2.0 py36_0
keras-preprocessing 1.0.1 py36_0
libcxx 4.0.1 h579ed51_0
libcxxabi 4.0.1 hebd6815_0
libedit 3.1.20170329 hb402a30_2
libffi 3.2.1 h475c297_4
libgfortran 3.0.1 h93005f0_2
libprotobuf 3.5.2 h2cd40f5_0
markdown 2.6.11 py36_0
mkl 2018.0.3 1
mkl_fft 1.0.1 py36h917ab60_0
mkl_random 1.0.1 py36h78cc56f_0
ncurses 6.1 h0a44026_0
numpy 1.14.5 py36h9bb19eb_3
numpy-base 1.14.5 py36ha9ae307_3
openssl 1.0.2o h26aff7b_0
pandas 0.23.1 py36h1702cab_0
pip 10.0.1 py36_0
protobuf 3.5.2 py36h0a44026_0
psutil 5.4.6 py36h1de35cc_0
python 3.6.6 hc167b69_0
python-dateutil 2.7.3 py36_0
pytz 2018.5 py36_0
pyyaml 3.12 py36h2ba1e63_1
readline 7.0 hc1231fa_4
scikit-learn 0.19.1 py36hffbff8c_0
scipy 1.1.0 py36hcaad992_0
setuptools 39.2.0 py36_0
six 1.11.0 py36h0e22d5e_1
sqlite 3.24.0 ha441bb4_0
tensorboard 1.8.0 py36_1 conda-forge
tensorflow 1.8.0 py36_1 conda-forge
termcolor 1.1.0 py36_1
time 1.7 0 conda-forge
tk 8.6.7 h35a86e2_3
webencodings 0.5.1 py36h3b9701d_1
werkzeug 0.14.1 py36_0
wheel 0.31.1 py36_0
xz 5.2.4 h1de35cc_4
yaml 0.1.7 hc338f04_2
zlib 1.2.11 hf3cbc9b_2 ```

yazabaza on 9 Jul 2018

👍2

I am having a similar issue - I am running keras training sessions in a loop.
First one works and the second one throws a "segmentation fault"

This is running Keras 2.2.4 (Tensorflow 1.8, ubuntu 16.04).
When I downgraded to the version that was working before Keras 2.1.1 everything works correctly again.

Apologies with time constraints I cannot do a lot more debugging now, but I can assist in the coming days if this would help.

MyHeadInTheClouds on 4 Nov 2018

Any news on this? I'm having Seg faults as well...
But only if I run keras. Using just tensorflow with gpu works fine.

Will try to downgrade keras.

drozzy on 13 Nov 2018

👍3

Just an update, using keras docker container worked for me:
https://github.com/keras-team/keras/tree/master/docker

drozzy on 10 Dec 2018

I tried what @yazabaza posted; I was on Keras 2.2.4 and tensorflow 1.12.0 and was getting the segmentation faults. Downgraded to keras 2.1.6 and Tensorflow 1.8.0 and the errors stopped.

aldarm on 20 Jan 2019

🎉2 👍2

Downgraded to keras 2.1.6 and Tensorflow 1.8.0 and the errors stopped.

keras 2.1.6 and Tensorflow-GPU 1.8.0 also worked on my machine (GeForce RTX 2080).
Thank you.

peaceiris on 24 Jan 2019

Hey everyone, I use the code from this blog
and When I migrate this code into my NVIDIA Jetson TX2, it seems not to work smoothly:

nvidia@tegra-ubuntu:~/Documents/AntiNet/liveness-detection-opencv$ sudo python3 liveness_demo.py --model liveness20190329.model --le le20190329.pickle --detector face_detector
Using TensorFlow backend.
2019-05-06 06:29:07.385706: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:864] ARM64 does not support NUMA - returning NUMA node zero
2019-05-06 06:29:07.385929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 5.85GiB
2019-05-06 06:29:07.386016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2019-05-06 06:29:09.087313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-06 06:29:09.087450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2019-05-06 06:29:09.087497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2019-05-06 06:29:09.087809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5336 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
[INFO] loading face detector...
[INFO] loading liveness detector...
Segmentation fault (core dumped)

anybody can help me? @yuyang-huang