Models: Tensorflow Training Keeps Stopping at self._traceback = tf_stack.extract_stack()

Created on 4 Sep 2019 · 31Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: tensorflow/models
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 64bit
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.14
Have I written custom code: No
Bazel version: N/A
CUDA/cuDNN version: CUDA 10.0 / cuDNN v7.6.3 (As far as I know only this combination is compatible with TensorFlow API as of now )
GPU model and memory: Nvidia P3200 and 32GB RAM
Exact command to reproduce:
python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_coco.config

Describe the problem

Model would train and abruptly stop eventually after 30-60 minutes or so every time and then I will have to re-run the training script and it will continue at the last Step. I have had no problems before running simpler models with smaller image pool of 200-400 training images. This time I copied over a full Pascal VOC 2007 model with over 10,000 training/test images into the API folders for training on the pretrained faster_rcnn_inception_v2_coco model. I had to also manually re-verify on ImageLbl each of the 10K images due to Pascal VOC XML files being annotated differently than what would be generated by default using ImageLbl.

I would run train.py and it will train normally for 30-60minutes and abruptly stop. The CMD prompt always shows the last line as always at 2005:
c:.....conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()

research stalled awaiting response support

Source

Stigmaru

All 31 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
Bazel version
CUDA/cuDNN version
GPU model and memory

tensorflowbutler on 5 Sep 2019

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
Bazel version
CUDA/cuDNN version
GPU model and memory

Updated. Thanks.

Stigmaru on 5 Sep 2019

👍1

I am also facing this issue. Any idea or suggestion to slove it?

12345k on 14 Nov 2019

Same for me while running albert model

ZheyuYe on 18 Nov 2019

Try adding these lines of code immediately after importing tensorflow in train.py

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

caohoangphuctd97 on 11 Dec 2019

👎2 👍1

@caohoangphuctd97
It didn't work out, any way for this to be fixed?

pablo-ticnow on 30 Dec 2019

👍1

Go to common.py and change the model_variant to the one that you have used to train your model. In my case it is Xception_65. The default is MobilenetV2 for deeplab

behnamnkp on 13 Feb 2020

@caohoangphuctd97
It didn't work out, any way for this to be fixed?

hit the same issue when made some changes (changed batch size from 1 to 2 in faster rcnn resnet 50 architecture), then I changed BS back to 1 and issue was gone. hopefully it will help others too. :)

AI-07 on 27 Feb 2020

👎1 👍1

@caohoangphuctd97
It didn't work out, any way for this to be fixed?

hit the same issue when made some changes (changed batch size from 1 to 2 in faster rcnn resnet 50 architecture), then I changed BS back to 1 and issue was gone. hopefully it will help others too. :)

it works to me , thanks
do anyone know the reason ?

thanhlong1997 on 3 Mar 2020

Are you guys updating the .config? Where are you changing the batch size at? I'm running TF 1.15.0 with ssd_inception_v2_coco

Syirrus on 19 Mar 2020

@Syirrus yes the batch size in config file on line 140

Ehabur on 28 Mar 2020

I have the same issue while i was training ssd_inception_v2_coco model and the problem was solved just by restarting my laptop. hope this help some one. I am using tensorflow-gpu 1.14

Ehabur on 1 Apr 2020

👎1

I got this error because the GPU was not empty, there was another process using it in the same time
so killing the previous process solved this issue for me

kamalelsaaid on 13 May 2020

I have the same problem of Tensorflow aborting the training process at step 1400 with self._traceback = tf_stack.extract_stack(), even after adding the lines by @caohoangphuctd97 . I have my batch_size already down at 1, and don't know how else I could fix it. I use an Nvidia GeForce GTX 1650 with 4GB dedicated memory.
Also, I'm getting this error:

(0) Invalid argument: Nan in summary histogram for:
ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance

Any suggestions?

Totemi1324 on 18 May 2020

in my case, it is because I forgot to run sess.run(tf.compat.v1.global_variables_initializer()) "Attempting to use uninitialized value fc_2/dense/kernel"

DiyuanLu on 28 May 2020

in my case, it is because I forgot to run sess.run(tf.compat.v1.global_variables_initializer()) "Attempting to use uninitialized value fc_2/dense/kernel"

Is this the code I have to insert into train.py?

Totemi1324 on 1 Jun 2020

Hi I am still facing this issue even after adding the lines by @caohoangphuctd97 , and tired different batch size ,including 1, can anyone help solve it

vishal180197 on 8 Jun 2020

in my case, it was because the Dataset ran out of elements and the last batch does not have the specified number of samples.
I changed to
mydataset.repeat().batch(batch_size, drop_remainder=True)
and it worked fine

DiyuanLu on 8 Jun 2020

Hi @vishal180197,
In the meantime, I accidentally discovered the solution myself.
The answer by @caohoangphuctd97 didn't help me, but I set the batch size up to 2 and then it worked. I first thought the batch size has to be an even number, but the explanation by @DiyuanLu also makes sense here.
Though I can't confirm his code, I would recommend to experience with the batch size. Choose the highest your computer can handle (if there is a memory allocation error, Tensorflow will Tell you in advance) and one that fits your dataset.

Totemi1324 on 8 Jun 2020

mydataset.repeat().batch(batch_size, drop_remainder=True)

Hi @DiyuanLu Where do i make this change in my model_main.py? i am pretty new to this so have very less idea ,

@Totemi1324 , i tried various batch size both even and odd, till the point of memory allocation error , but so far no success , any other pointer's?

vishal180197 on 9 Jun 2020

Then I can't help you further, for me this worked. Good luck in finding a proper solution!

Totemi1324 on 9 Jun 2020

@vishal180197 , given that you are really using tf.dataset.Dataset API.
e.g., you initialize a dataset with
dataset = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))
https://www.tensorflow.org/guide/data
then
dataset = dataset.shuffle().repeat().batch(batch_size, drop_remainder=True)
If you are not using dataset, then this doesn't apply to you.
Good luck

DiyuanLu on 9 Jun 2020

Try adding these lines of code immediately after importing tensorflow in train.py

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

It's works, thanks!!

giangcse on 15 Jun 2020

👎1

@AeroWRX

Is this still an issue?.Please, close this thread if your issue was resolved.Thanks!

ravikyram on 22 Jun 2020

It is not resolved for me. I have tried everything suggested above.

csingh27 on 10 Jul 2020

@csingh27 Try downgrading your numpy version to 1.16 that's wwhat worked for me

vishal180197 on 10 Jul 2020

@vishal180197 thanks man for your reply. Unfortunately this also does not work. Any other suggestions. I am stuck at this for atleast a week now. Basically tried every possible suggestion out there.

csingh27 on 10 Jul 2020

Please guys any support would be deeply appreciated.

csingh27 on 10 Jul 2020

i have same issue,
TensorFlow = 2.3
when i try to use placeholder i should use Sessions to run, but it doesn't run, if i delete placeholder it successfully run
`import tensorflow as tf
tf.compat.v1.disable_eager_execution()

v1 = tf.Variable(2)
v2 = tf.Variable(4)
p1 = tf.compat.v1.placeholder(tf.float32)
r1 = tf.add(v1, v2)
print(r1)
s = tf.compat.v1.Session()
print(s.run(r1, feed_dict={p1: 5.5}))
`

what's solution @tensorflowbutler ?