System information
What is the top-level directory of the model you are using: Object Detection
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow version (use command below): 1.4
CUDA/cuDNN version: cuda 8.0
GPU model and memory: GeForce GTX 1080 Ti x4
Describe the problem
trained on own dataset with nearly identical pipeline config (changed image size to 800x800 and first_stage_max_proposals to 200), in particular, train's batch_size is 1. The dataset is about 1/10 size of coco (around 20000 images). The dataset trains well across many popular models. The training is initialized with Google's pretrained coco model weight.
Got following error:"ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,672,9,9]".
Since we don't have access to your dataset, can you reproduce this with a public data set (maybe coco)?
Thx. Will try Pascal VOC dataset, which is similar in size to the dataset to see if same issue exists.
I have the same issue, Is there a solution?
Did you encounter the issue on standard dataset e.g. Pascal VOC or coco. If
so, please describe detail (dataset, env etc) and that will help google
team to fix it.
On Tue, Jan 2, 2018 at 1:49 AM, roitmaster notifications@github.com wrote:
I have the same issue, Is there a solution?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3014#issuecomment-354731110,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAM4K6vqzPA-f-gZxVpg3HVEAvty6xcXks5tGfuTgaJpZM4RD-J7
.
@jwnsu: [...] in particular, train's batch_size is 1 [...]
How did you set this?
I had a similar problem and succeeded on two "Tesla K40c"s by adding
model {
...
second_stage_batch_size: 4
... }
to the model's configuration file.
thanks, will try this setting (it was set at 50 in my config earlier).
After set second_stage_batch_size: 4, it now proceeds in training. However, it only runs on 1 GPU despite CUDA_VISIBLE_DEVICES is set to 4 gpus. The 4 GPUs show memories are allocated, only 1 has training activities. Any particular setting to get all GPUs in training (didn't find such info in readme):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 40% 69C P2 222W / 250W | 10797MiB / 11172MiB | 64% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 30C P8 8W / 250W | 10625MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 23% 21C P8 8W / 250W | 10625MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 24C P8 9W / 250W | 10625MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 22802 C python 10785MiB |
| 2 22802 C python 10613MiB |
| 4 22802 C python 10613MiB |
| 5 22802 C python 10613MiB |
+-----------------------------------------------------------------------------+
You need to pass "--num_clones 4" option to the train.py in order to enable training on 4 GPUs.
Closing as this is resolved
Most helpful comment
How did you set this?
I had a similar problem and succeeded on two "Tesla K40c"s by adding
to the model's configuration file.