Models: Help!!Segmentation fault (core dumped)?tensorflow version problem?driver problem?

Created on 11 May 2019  路  11Comments  路  Source: tensorflow/models

~/MyFiles/tensorflow/models/research$ python deeplab/train.py \

--logtostderr \
--training_number_of_steps=30000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="513,513" \
--train_batch_size=1 \
--dataset="pascal_voc_seg" \
--tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
--train_logdir=${PATH_TO_TRAIN_DIR} \
--dataset_dir=${PATH_TO_DATASET}

INFO:tensorflow:Training on train set
WARNING:tensorflow:From deeplab/train.py:418: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
python sess = tf.Session() with sess.as_default(): tensor = tf.range(10) print_op = tf.print(tensor) with tf.control_dependencies([print_op]): out = tf.add(tensor, tensor) sess.run(out)
Additionally, to use tf.print in python 2.7, users must make sure to import
the following:

from __future__ import print_function

INFO:tensorflow:Graph was finalized.
2019-05-11 22:32:00.265339: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-11 22:32:00.434722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:04:00.0
totalMemory: 11.91GiB freeMemory: 11.77GiB
2019-05-11 22:32:00.434782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-05-11 22:32:00.919006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-11 22:32:00.919075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-05-11 22:32:00.919087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-05-11 22:32:00.919707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11385 MB memory) -> physical GPU (device: 0, name: TITAN X (Pascal), pci bus id: 0000:04:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Segmentation fault (core dumped)

awaiting response

Most helpful comment

I got this error when --dataset_dir=${PATH_TO_DATASET} didn't point to the TFRecord directory
I correct it and it works

All 11 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

sovled by using pthon2.7 tensorflow newest version

I'm having the same problem but only when training with train_aug set. Can someone help?

Using python2.7 with current tensorflow doesn't help for me.

I got this error when --dataset_dir=${PATH_TO_DATASET} didn't point to the TFRecord directory
I correct it and it works

slim$ python train_image_classifier.py
2019-08-12 15:49:30.085787: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-08-12 15:49:33.824887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:02:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-08-12 15:49:34.080394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:03:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-08-12 15:49:34.320103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:82:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-08-12 15:49:34.558022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:83:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-08-12 15:49:34.558463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-08-12 15:49:36.278881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-12 15:49:36.278941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-08-12 15:49:36.278950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N N N N
2019-08-12 15:49:36.278955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: N N N N
2019-08-12 15:49:36.278960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N N
2019-08-12 15:49:36.278965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N N N
2019-08-12 15:49:36.280058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10170 MB memory) -> physical GPU (device:
0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)2019-08-12 15:49:36.405281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10170 MB memory) -> physical GPU (device:
1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5)2019-08-12 15:49:36.559362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10170 MB memory) -> physical GPU (device:
2, name: GeForce RTX 2080 Ti, pci bus id: 0000:82:00.0, compute capability: 7.5)2019-08-12 15:49:36.718446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10170 MB memory) -> physical GPU (device:
3, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5)Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:82:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5
2019-08-12 15:49:36.859223: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:82:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5

WARNING:tensorflow:From train_image_classifier.py:413: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
WARNING:tensorflow:From /home/fsy/slim/nets/resnet_v1.py:244: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From train_image_classifier.py:481: softmax_cross_entropy (from tensorflow.contrib.losses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.
Instructions for updating:
Use tf.losses.softmax_cross_entropy instead. Note that the order of the logits and labels arguments has been changed.
WARNING:tensorflow:From /home/fsy/anaconda3/envs/cbl/lib/python3.5/site-packages/tensorflow/contrib/losses/python/losses/loss_ops.py:398: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_
ops) is deprecated and will be removed in a future version.Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

WARNING:tensorflow:From /home/fsy/anaconda3/envs/cbl/lib/python3.5/site-packages/tensorflow/contrib/losses/python/losses/loss_ops.py:399: compute_weighted_loss (from tensorflow.contrib.losses.python.loss
es.loss_ops) is deprecated and will be removed after 2016-12-30.Instructions for updating:
Use tf.losses.compute_weighted_loss instead.
WARNING:tensorflow:From /home/fsy/anaconda3/envs/cbl/lib/python3.5/site-packages/tensorflow/contrib/losses/python/losses/loss_ops.py:147: add_arg_scope..func_with_args (from tensorflow.contrib.lo
sses.python.losses.loss_ops) is deprecated and will be removed after 2016-12-30.Instructions for updating:
Use tf.losses.add_loss instead.
WARNING:tensorflow:From /home/fsy/anaconda3/envs/cbl/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is d
eprecated and will be removed in a future version.Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-08-12 15:49:49.013991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-08-12 15:49:49.014759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-12 15:49:49.014784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-08-12 15:49:49.014797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N N N N
2019-08-12 15:49:49.014806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: N N N N
2019-08-12 15:49:49.014817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N N
2019-08-12 15:49:49.014842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N N N
2019-08-12 15:49:49.015676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10170 MB memory) -> physical GPU (device:
0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)2019-08-12 15:49:49.017773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10170 MB memory) -> physical GPU (device:
1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5)2019-08-12 15:49:49.017951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10170 MB memory) -> physical GPU (device:
2, name: GeForce RTX 2080 Ti, pci bus id: 0000:82:00.0, compute capability: 7.5)2019-08-12 15:49:49.018109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10170 MB memory) -> physical GPU (device:
3, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5)INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path save_models/1/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
Segmentation fault (core dumped)

models/research/slim/ run on a single 2080 GPU was right,but Segmentation fault (core dumped) error happend on four GPUs

@npbackdraft have you solved now ? i also tried to use py2.7, but it didn't work.

see @tzahiw solution above. That's what my problem was. The crashing behavior is not at all indicative of the problem of course.

see @tzahiw solution above. That's what my problem was. The crashing behavior is not at all indicative of the problem of course.

thanks, solved it, i input one more blank for the dataset path.

I had the same error. I just had to change the path to dataset dir. Earlier it pointed to the dataset directory but then i pointed it to the generated tfrecords which seemed to solve the error.

Was this page helpful?
0 / 5 - 0 ratings