TensorFlow version (use command below): 1.14
Have I written custom code: No
Bazel version: N/A
CUDA/cuDNN version: CUDA 10.0 / cuDNN v7.6.3 (As far as I know only this combination is compatible with TensorFlow API as of now )
GPU model and memory: Nvidia P3200 and 32GB RAM
Exact command to reproduce:
python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_coco.config
Model would train and abruptly stop eventually after 30-60 minutes or so every time and then I will have to re-run the training script and it will continue at the last Step. I have had no problems before running simpler models with smaller image pool of 200-400 training images. This time I copied over a full Pascal VOC 2007 model with over 10,000 training/test images into the API folders for training on the pretrained faster_rcnn_inception_v2_coco model. I had to also manually re-verify on ImageLbl each of the 10K images due to Pascal VOC XML files being annotated differently than what would be generated by default using ImageLbl.
I would run train.py and it will train normally for 30-60minutes and abruptly stop. The CMD prompt always shows the last line as always at 2005:
c:.....conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
Bazel version
CUDA/cuDNN version
GPU model and memory
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
Bazel version
CUDA/cuDNN version
GPU model and memory
Updated. Thanks.
I am also facing this issue. Any idea or suggestion to slove it?
Same for me while running albert model
Try adding these lines of code immediately after importing tensorflow in train.py
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
@caohoangphuctd97
It didn't work out, any way for this to be fixed?
Go to common.py and change the model_variant to the one that you have used to train your model. In my case it is Xception_65. The default is MobilenetV2 for deeplab
@caohoangphuctd97
It didn't work out, any way for this to be fixed?
hit the same issue when made some changes (changed batch size from 1 to 2 in faster rcnn resnet 50 architecture), then I changed BS back to 1 and issue was gone. hopefully it will help others too. :)
@caohoangphuctd97
It didn't work out, any way for this to be fixed?hit the same issue when made some changes (changed batch size from 1 to 2 in faster rcnn resnet 50 architecture), then I changed BS back to 1 and issue was gone. hopefully it will help others too. :)
it works to me , thanks
do anyone know the reason ?
Are you guys updating the .config? Where are you changing the batch size at? I'm running TF 1.15.0 with ssd_inception_v2_coco
@Syirrus yes the batch size in config file on line 140
I have the same issue while i was training ssd_inception_v2_coco model and the problem was solved just by restarting my laptop. hope this help some one. I am using tensorflow-gpu 1.14
I got this error because the GPU was not empty, there was another process using it in the same time
so killing the previous process solved this issue for me
I have the same problem of Tensorflow aborting the training process at step 1400 with self._traceback = tf_stack.extract_stack()
, even after adding the lines by @caohoangphuctd97 . I have my batch_size already down at 1, and don't know how else I could fix it. I use an Nvidia GeForce GTX 1650 with 4GB dedicated memory.
Also, I'm getting this error:
(0) Invalid argument: Nan in summary histogram for:
ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
Any suggestions?
in my case, it is because I forgot to run sess.run(tf.compat.v1.global_variables_initializer()) "Attempting to use uninitialized value fc_2/dense/kernel"
in my case, it is because I forgot to run sess.run(tf.compat.v1.global_variables_initializer()) "Attempting to use uninitialized value fc_2/dense/kernel"
Is this the code I have to insert into train.py?
Hi I am still facing this issue even after adding the lines by @caohoangphuctd97 , and tired different batch size ,including 1, can anyone help solve it
in my case, it was because the Dataset ran out of elements and the last batch does not have the specified number of samples.
I changed to
mydataset.repeat().batch(batch_size, drop_remainder=True)
and it worked fine
Hi @vishal180197,
In the meantime, I accidentally discovered the solution myself.
The answer by @caohoangphuctd97 didn't help me, but I set the batch size up to 2 and then it worked. I first thought the batch size has to be an even number, but the explanation by @DiyuanLu also makes sense here.
Though I can't confirm his code, I would recommend to experience with the batch size. Choose the highest your computer can handle (if there is a memory allocation error, Tensorflow will Tell you in advance) and one that fits your dataset.
mydataset.repeat().batch(batch_size, drop_remainder=True)
Hi @DiyuanLu Where do i make this change in my model_main.py? i am pretty new to this so have very less idea ,
@Totemi1324 , i tried various batch size both even and odd, till the point of memory allocation error , but so far no success , any other pointer's?
Then I can't help you further, for me this worked. Good luck in finding a proper solution!
@vishal180197 , given that you are really using tf.dataset.Dataset API.
e.g., you initialize a dataset with
dataset = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))
https://www.tensorflow.org/guide/data
then
dataset = dataset.shuffle().repeat().batch(batch_size, drop_remainder=True)
If you are not using dataset, then this doesn't apply to you.
Good luck
Try adding these lines of code immediately after importing tensorflow in train.py
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSessionconfig = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
It's works, thanks!!
@AeroWRX
Is this still an issue?.Please, close this thread if your issue was resolved.Thanks!
It is not resolved for me. I have tried everything suggested above.
@csingh27 Try downgrading your numpy version to 1.16 that's wwhat worked for me
@vishal180197 thanks man for your reply. Unfortunately this also does not work. Any other suggestions. I am stuck at this for atleast a week now. Basically tried every possible suggestion out there.
Please guys any support would be deeply appreciated.
i have same issue,
TensorFlow = 2.3
when i try to use placeholder i should use Sessions to run, but it doesn't run, if i delete placeholder it successfully run
`import tensorflow as tf
tf.compat.v1.disable_eager_execution()
v1 = tf.Variable(2)
v2 = tf.Variable(4)
p1 = tf.compat.v1.placeholder(tf.float32)
r1 = tf.add(v1, v2)
print(r1)
s = tf.compat.v1.Session()
print(s.run(r1, feed_dict={p1: 5.5}))
`
what's solution @tensorflowbutler ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.
Closing as stale. Please reopen if you'd like to work on this further.