Models: Object detection API : faster_rcnn_nas out-of-memory issue

Created on 15 Dec 2017 · 9Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: Object Detection
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow version (use command below): 1.4
CUDA/cuDNN version: cuda 8.0
GPU model and memory: GeForce GTX 1080 Ti x4

Describe the problem

trained on own dataset with nearly identical pipeline config (changed image size to 800x800 and first_stage_max_proposals to 200), in particular, train's batch_size is 1. The dataset is about 1/10 size of coco (around 20000 images). The dataset trains well across many popular models. The training is initialized with Google's pretrained coco model weight.

Got following error:"ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,672,9,9]".

Source

jwnsu

Most helpful comment

@jwnsu: [...] in particular, train's batch_size is 1 [...]

How did you set this?

I had a similar problem and succeeded on two "Tesla K40c"s by adding

model {
...
    second_stage_batch_size: 4
... }

to the model's configuration file.

bermeitinger-b on 16 Jan 2018

❤2 🚀1 👍1

All 9 comments

Since we don't have access to your dataset, can you reproduce this with a public data set (maybe coco)?

andydavis1 on 15 Dec 2017

Thx. Will try Pascal VOC dataset, which is similar in size to the dataset to see if same issue exists.

jwnsu on 16 Dec 2017

I have the same issue, Is there a solution?

roitmaster on 2 Jan 2018

Did you encounter the issue on standard dataset e.g. Pascal VOC or coco. If
so, please describe detail (dataset, env etc) and that will help google
team to fix it.

On Tue, Jan 2, 2018 at 1:49 AM, roitmaster notifications@github.com wrote:

I have the same issue, Is there a solution?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3014#issuecomment-354731110,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAM4K6vqzPA-f-gZxVpg3HVEAvty6xcXks5tGfuTgaJpZM4RD-J7
.

jwnsu on 3 Jan 2018

@jwnsu: [...] in particular, train's batch_size is 1 [...]

How did you set this?

I had a similar problem and succeeded on two "Tesla K40c"s by adding

model {
...
    second_stage_batch_size: 4
... }

to the model's configuration file.

bermeitinger-b on 16 Jan 2018

❤2 🚀1 👍1

thanks, will try this setting (it was set at 50 in my config earlier).

jwnsu on 16 Jan 2018

jwnsu on 17 Jan 2018

You need to pass "--num_clones 4" option to the train.py in order to enable training on 4 GPUs.

kolesman on 24 Jan 2018

Closing as this is resolved

wt-huang on 9 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

InvalidArgumentError when fine-tuning inception_v3 on flowers

trungdn · 3Comments

Have issue training the model from scratch.

25b3nk · 3Comments

Export Inference Model Error

frankkloster · 3Comments

where can i find pretrained resnet model?

kamal4493 · 3Comments

Multi-GPU can't set when use model fasterRcnn_inception_resnet_v2

chenyuZha · 3Comments