Models: Object Detection Training Stops With "^C" Need Help

Created on 23 Aug 2020  路  5Comments  路  Source: tensorflow/models

Hello, I'm trying to train EfficientDet on my custom dataset. In spite of the EfficientDet backbone (D0-D1-D2-D3), Google Colab just kicks me out giving an error like this:

W0823 08:47:59.941824 140512077502208 optimizer_v2.py:1275] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead W0823 08:48:01.099456 140512077502208 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. W0823 08:48:24.028264 140512077502208 optimizer_v2.py:1275] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. tcmalloc: large alloc 1610612736 bytes == 0x28d7cc000 @ 0x7fcbe4652b6b 0x7fcbe4672379 0x7fcbc8c90207 0x7fcbba1bec4f 0x7fcbba24748b 0x7fcbba0b9d06 0x7fcbba0babcc 0x7fcbba0bae53 0x7fcbc4ed9189 0x7fcbba33c57f 0x7fcbba331605 0x7fcbba3ef591 0x7fcbba3ec2a3 0x7fcbba3db3e5 0x7fcbe40346db 0x7fcbe436da3f ^C

I've tried batch sizes 64 and 128, no luck. All the images in my dataset is 896x896. I don't know where I'm doing the mistake.

Can anybody please help?

research support

Most helpful comment

Reducing the batch size to 4 in Google Colab worked for me 馃憤

All 5 comments

In my training, for a shape of 896 I needed at least 12GB of vRAM at a batch size of 8 (in Colab) !
Try with even smaller batch sizes. In my experience models can learn even with a batch size of 2 or 4.
Training might take a bit longer, but depending on your dataset a couple of hours is usually enough.
On the other hand if training time is a problem, do you really need a shape of 896 ?
Maybe experiment with an input shape 768, 640 or even 512 ?

Thank you for your suggestions,
I've tried EfficientDet-D02, Batch Size 8 and Input Size 768x768. I'm on Tesla T4:

Sun Aug 23 10:23:11 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

And I got this error (Actually it's rather long, I'm just copying the last lines):

0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_89434]

Errors may have originated from an input operation.
Input Source operations connected to node EfficientDet-D2/functional_1/stack_5/block_4/se_excite/mul:
 EfficientDet-D2/functional_1/stack_5/block_4/depthwise_activation/mul (defined at /local/lib/python3.6/dist-packages/official/modeling/activations/swish.py:42)

Input Source operations connected to node EfficientDet-D2/functional_1/stack_5/block_4/se_excite/mul:
 EfficientDet-D2/functional_1/stack_5/block_4/depthwise_activation/mul (defined at /local/lib/python3.6/dist-packages/official/modeling/activations/swish.py:42)

Function call stack:
_dist_train_step -> _dist_train_step

In my training, for a shape of 896 I needed at least 12GB of vRAM at a batch size of 8 (in Colab) !
Try with even smaller batch sizes. In my experience models can learn even with a batch size of 2 or 4.
Training might take a bit longer, but depending on your dataset a couple of hours is usually enough.
On the other hand if training time is a problem, do you really need a shape of 896 ?
Maybe experiment with an input shape 768, 640 or even 512 ?

My config file is attached. Maybe I'm doing something wrong here.
cfg.txt

Reducing the batch size to 4 in Google Colab worked for me 馃憤

I had the same problem decreasing image dim and batch size in config file resolved the issue. I think this is because not enough ram being available while training

Was this page helpful?
0 / 5 - 0 ratings