Hello, I'm trying to train EfficientDet on my custom dataset. In spite of the EfficientDet backbone (D0-D1-D2-D3), Google Colab just kicks me out giving an error like this:
W0823 08:47:59.941824 140512077502208 optimizer_v2.py:1275] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0823 08:48:01.099456 140512077502208 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss.
W0823 08:48:24.028264 140512077502208 optimizer_v2.py:1275] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss.
tcmalloc: large alloc 1610612736 bytes == 0x28d7cc000 @ 0x7fcbe4652b6b 0x7fcbe4672379 0x7fcbc8c90207 0x7fcbba1bec4f 0x7fcbba24748b 0x7fcbba0b9d06 0x7fcbba0babcc 0x7fcbba0bae53 0x7fcbc4ed9189 0x7fcbba33c57f 0x7fcbba331605 0x7fcbba3ef591 0x7fcbba3ec2a3 0x7fcbba3db3e5 0x7fcbe40346db 0x7fcbe436da3f
^C
I've tried batch sizes 64 and 128, no luck. All the images in my dataset is 896x896. I don't know where I'm doing the mistake.
Can anybody please help?
In my training, for a shape of 896 I needed at least 12GB of vRAM at a batch size of 8 (in Colab) !
Try with even smaller batch sizes. In my experience models can learn even with a batch size of 2 or 4.
Training might take a bit longer, but depending on your dataset a couple of hours is usually enough.
On the other hand if training time is a problem, do you really need a shape of 896 ?
Maybe experiment with an input shape 768, 640 or even 512 ?
Thank you for your suggestions,
I've tried EfficientDet-D02, Batch Size 8 and Input Size 768x768. I'm on Tesla T4:
Sun Aug 23 10:23:11 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 54C P8 10W / 70W | 0MiB / 15079MiB | 0% Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
And I got this error (Actually it's rather long, I'm just copying the last lines):
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_89434]
Errors may have originated from an input operation.
Input Source operations connected to node EfficientDet-D2/functional_1/stack_5/block_4/se_excite/mul:
EfficientDet-D2/functional_1/stack_5/block_4/depthwise_activation/mul (defined at /local/lib/python3.6/dist-packages/official/modeling/activations/swish.py:42)
Input Source operations connected to node EfficientDet-D2/functional_1/stack_5/block_4/se_excite/mul:
EfficientDet-D2/functional_1/stack_5/block_4/depthwise_activation/mul (defined at /local/lib/python3.6/dist-packages/official/modeling/activations/swish.py:42)
Function call stack:
_dist_train_step -> _dist_train_step
In my training, for a shape of 896 I needed at least 12GB of vRAM at a batch size of 8 (in Colab) !
Try with even smaller batch sizes. In my experience models can learn even with a batch size of 2 or 4.
Training might take a bit longer, but depending on your dataset a couple of hours is usually enough.
On the other hand if training time is a problem, do you really need a shape of 896 ?
Maybe experiment with an input shape 768, 640 or even 512 ?
My config file is attached. Maybe I'm doing something wrong here.
cfg.txt
Reducing the batch size to 4 in Google Colab worked for me 馃憤
I had the same problem decreasing image dim and batch size in config file resolved the issue. I think this is because not enough ram being available while training
Most helpful comment
Reducing the batch size to 4 in Google Colab worked for me 馃憤