Hello,
I got one segfault so many times. I've just installed gdb and got a back trace below. My system is a fresh Ubuntu 16.04 (newly installed), keras 2.2, tf 1.9.0rc1, numpy 1.14.5 (compiled from source). Please help.
Thanks in advance.
`Thread 20 "python3.5" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff779bf700 (LWP 2911)]
0x0000000000000045 in ?? ()
(gdb) bt
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflo---Type
w_framework.so
at pthread_create.c:333
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)
`
Sometimes it's a hardware problem. Do you have segfault also on other libraries (e.g. PyTorch), or does it occur only on TF+Keras?
@yuyang-huang - I do not think it is a hardware issue. I used it for quite a while (extensively), without any problems. But I haven't tried other libraries yet.
Hi,
Source code and data files attached. I am sure you will get the same segfault. Please help.
Tuan
files.tar.gz
You've applied StandardScaler to the input of Embedding layer. The negative values lead to the SEGFAULT in TF.
@yuyang-huang - Thanks, I do not know this. But I just added standardscaler this morning. Segfault happens all the time since 2 weeks ago. It runs few tens epochs (20-200) then dies.
I was wrong, then. Seems like negative values are acceptable to tensorflow-gpu. It just doesn't pass on the CPU version.
They are accepted but there is no reason to standard scale Embedding inputs + it is still a very dangerous behavior in my opinion.
Maybe try decreasing the batch size and see if that helps?
My GPU has 11G RAM, and my model is (correct me if I am wrong):
Okay, just checking... Try downgrading to Tensorflow 1.8 - maybe the release candidate is not stable (doubt that will help though). I had segfault happen because of issues with BLAS before but I doubt that is the case for you. From the stack trace it really doesn't look like an error with Keras to be honest
I observe that keras 2.2.0 + tensorflow 1.8.0 consistently produces "Segmentation fault: 11" while keras 2.1.6 + tensorflow 1.8.0 runs fine. I am running conda version : 4.5.5, conda-build version : 3.10.5.. My conda environment package lists are below. The first one (with keras 2.1.6) runs OK, the second (keras 2.2.0) throws segmentation faults. This is on MacOS 10.13.5. I also observe seg fault on Ubuntu 18.04, keras 2.2.0, tensorflow-gpu 1.8.0, and like the Mac install, reverting to keras 2.1.6 cured the seg faults.
```
packages in environment at /Users/r/anaconda3/envs/tfa:
Name Version Build Channel
absl-py 0.2.2 py_0 conda-forge
astor 0.6.2 py_0 conda-forge
blas 1.0 mkl
bleach 1.5.0 py36_0 conda-forge
ca-certificates 2018.03.07 0
certifi 2018.4.16 py36_0
gast 0.2.0 py_0 conda-forge
grpcio 1.12.1 py36hd9629dc_0
h5py 2.8.0 py36ha8ecd60_0
hdf5 1.10.2 hfa1e0ec_1
html5lib 0.9999999 py36_0 conda-forge
intel-openmp 2018.0.3 0
keras 2.1.6 py36_0
libcxx 4.0.1 h579ed51_0
libcxxabi 4.0.1 hebd6815_0
libedit 3.1.20170329 hb402a30_2
libffi 3.2.1 h475c297_4
libgfortran 3.0.1 h93005f0_2
libprotobuf 3.5.2 hd28b015_1 conda-forge
markdown 2.6.11 py_0 conda-forge
mkl 2018.0.3 1
mkl_fft 1.0.1 py36h917ab60_0
mkl_random 1.0.1 py36h78cc56f_0
ncurses 6.1 h0a44026_0
numpy 1.14.5 py36h9bb19eb_3
numpy-base 1.14.5 py36ha9ae307_3
openssl 1.0.2o h26aff7b_0
pandas 0.23.1 py36h1702cab_0
pip 10.0.1 py36_0
protobuf 3.5.2 py36_0 conda-forge
psutil 5.4.6 py36h1de35cc_0
python 3.6.6 hc167b69_0
python-dateutil 2.7.3 py36_0
pytz 2018.5 py36_0
pyyaml 3.12 py36h2ba1e63_1
readline 7.0 hc1231fa_4
scikit-learn 0.19.1 py36hffbff8c_0
scipy 1.1.0 py36hcaad992_0
setuptools 39.2.0 py36_0
six 1.11.0 py36h0e22d5e_1
sqlite 3.24.0 ha441bb4_0
tensorboard 1.8.0 py36_1 conda-forge
tensorflow 1.8.0 py36_1 conda-forge
termcolor 1.1.0 py_2 conda-forge
time 1.7 0 conda-forge
tk 8.6.7 h35a86e2_3
webencodings 0.5.1 py36_0 conda-forge
werkzeug 0.14.1 py_0 conda-forge
wheel 0.31.1 py36_0
xz 5.2.4 h1de35cc_4
yaml 0.1.7 hc338f04_2
zlib 1.2.11 hf3cbc9b_2
packages in environment at /Users/r/anaconda3/envs/tfb:
Name Version Build Channel
absl-py 0.2.2 py36_0
astor 0.6.2 py36_0
blas 1.0 mkl
bleach 1.5.0 py36_0
ca-certificates 2018.03.07 0
certifi 2018.4.16 py36_0
gast 0.2.0 py36_0
grpcio 1.12.1 py36hd9629dc_0
h5py 2.8.0 py36ha8ecd60_0
hdf5 1.10.2 hfa1e0ec_1
html5lib 0.9999999 py36_0 conda-forge
intel-openmp 2018.0.3 0
keras 2.2.0 0
keras-applications 1.0.2 py36_0
keras-base 2.2.0 py36_0
keras-preprocessing 1.0.1 py36_0
libcxx 4.0.1 h579ed51_0
libcxxabi 4.0.1 hebd6815_0
libedit 3.1.20170329 hb402a30_2
libffi 3.2.1 h475c297_4
libgfortran 3.0.1 h93005f0_2
libprotobuf 3.5.2 h2cd40f5_0
markdown 2.6.11 py36_0
mkl 2018.0.3 1
mkl_fft 1.0.1 py36h917ab60_0
mkl_random 1.0.1 py36h78cc56f_0
ncurses 6.1 h0a44026_0
numpy 1.14.5 py36h9bb19eb_3
numpy-base 1.14.5 py36ha9ae307_3
openssl 1.0.2o h26aff7b_0
pandas 0.23.1 py36h1702cab_0
pip 10.0.1 py36_0
protobuf 3.5.2 py36h0a44026_0
psutil 5.4.6 py36h1de35cc_0
python 3.6.6 hc167b69_0
python-dateutil 2.7.3 py36_0
pytz 2018.5 py36_0
pyyaml 3.12 py36h2ba1e63_1
readline 7.0 hc1231fa_4
scikit-learn 0.19.1 py36hffbff8c_0
scipy 1.1.0 py36hcaad992_0
setuptools 39.2.0 py36_0
six 1.11.0 py36h0e22d5e_1
sqlite 3.24.0 ha441bb4_0
tensorboard 1.8.0 py36_1 conda-forge
tensorflow 1.8.0 py36_1 conda-forge
termcolor 1.1.0 py36_1
time 1.7 0 conda-forge
tk 8.6.7 h35a86e2_3
webencodings 0.5.1 py36h3b9701d_1
werkzeug 0.14.1 py36_0
wheel 0.31.1 py36_0
xz 5.2.4 h1de35cc_4
yaml 0.1.7 hc338f04_2
zlib 1.2.11 hf3cbc9b_2 ```
I am having a similar issue - I am running keras training sessions in a loop.
First one works and the second one throws a "segmentation fault"
This is running Keras 2.2.4 (Tensorflow 1.8, ubuntu 16.04).
When I downgraded to the version that was working before Keras 2.1.1 everything works correctly again.
Apologies with time constraints I cannot do a lot more debugging now, but I can assist in the coming days if this would help.
Any news on this? I'm having Seg faults as well...
But only if I run keras. Using just tensorflow with gpu works fine.
Will try to downgrade keras.
Just an update, using keras docker container worked for me:
https://github.com/keras-team/keras/tree/master/docker
I tried what @yazabaza posted; I was on Keras 2.2.4 and tensorflow 1.12.0 and was getting the segmentation faults. Downgraded to keras 2.1.6 and Tensorflow 1.8.0 and the errors stopped.
Downgraded to keras 2.1.6 and Tensorflow 1.8.0 and the errors stopped.
keras 2.1.6 and Tensorflow-GPU 1.8.0 also worked on my machine (GeForce RTX 2080).
Thank you.
Hey everyone, I use the code from this blog
and When I migrate this code into my NVIDIA Jetson TX2, it seems not to work smoothly:
nvidia@tegra-ubuntu:~/Documents/AntiNet/liveness-detection-opencv$ sudo python3 liveness_demo.py --model liveness20190329.model --le le20190329.pickle --detector face_detector
Using TensorFlow backend.
2019-05-06 06:29:07.385706: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:864] ARM64 does not support NUMA - returning NUMA node zero
2019-05-06 06:29:07.385929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 5.85GiB
2019-05-06 06:29:07.386016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2019-05-06 06:29:09.087313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-06 06:29:09.087450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2019-05-06 06:29:09.087497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2019-05-06 06:29:09.087809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5336 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
[INFO] loading face detector...
[INFO] loading liveness detector...
Segmentation fault (core dumped)
anybody can help me? @yuyang-huang
Ran into the same problem with keras=2.2.2 and tensorflow-gpu=1.10.0.
Downgrading to keras=2.1.6 and tensorflow-gpu=1.8.0 solved the issue.
Running RTX 2060 on Ubuntu 16.04
Most helpful comment
I tried what @yazabaza posted; I was on Keras 2.2.4 and tensorflow 1.12.0 and was getting the segmentation faults. Downgraded to keras 2.1.6 and Tensorflow 1.8.0 and the errors stopped.