Higher mini_batch -> higher accuracy mAP/Top1/Top5.
Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.
You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.
Use in your cfg-file:
[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000
random=1 is not supportedTested:
Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with wifth=416 height=416 on 8GB_GPU_VRAM + 32GB_CPU_RAM
./darknet detector train data/obj.data yolov3-spp.cfg -map
default: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 sec
optimized_memory=1: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 sec
optimized_memory=2 workspace_size_limit_MB=1000: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 sec
optimized_memory=3 workspace_size_limit_MB=1000: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)
Not well tested yet:
optimized_memory=3 workspace_size_limit_MB=2000: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)
optimized_memory=3 workspace_size_limit_MB=2000 or 4000: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)
mini_batch=24 - 24 GB VRAM RTX Titan - $2500: https://www.amazon.com/NVIDIA-Titan-RTX-Graphics-Card/dp/B07L8YGDL5
mini_batch=48 - 48 GB VRAM Quadro RTX 8000 - $5500: https://www.amazon.com/PNY-VCQRTX8000-PB-NVIDIA-Quadro-Graphic/dp/B07NH3HKG9/
mini_batch=128 - 128 GB RAM - $1700 = RTX 2080 Ti 11 GB - $1100 + $600 CPU-RAM 128 GB = 4x32 + with this software solution
mini_batch=512 - 512 GB RAM - $9200 = 48 GB VRAM Quadro RTX 8000 - $5500 + 512GB=2 x (8 x 32GB), $2600 + $1100 - CPU AMD EPYC 7401P - 32 cores, 16 memory slots up to 2 TB RAM and 128 PCIe® 3.0 lanes + with this software solution
mini_batch=512 - 512 GB VRAM (16 x 32GB Tesla V100) DGX2 - $400 000 https://www.nvidia.com/en-us/data-center/dgx-2/ + with synchronized batch normalization technique solution like: https://arxiv.org/abs/1711.07240v4
Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt
| mini_batch=32 +5 [email protected] | mini_batch=8 |
|---|---|
|
|
|
|---|---|
do you think switching to this higher mini batch after having already train the usual way will give added value as well?
@HagegeR I didn't test it well. So just try.
In general - yes.
You can try to train the first several % of iterations with large mini_batch,
then continue training with small mini_batch for fast training,
and then continue training the last few percent of iterations with high mini_batch.
Please could you explain in more detail the meaning of the options or how to work out a good configuration? I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.
What does this mean?
optimized_memory=3
workspace_size_limit_MB=1000
@LukeAI
Param optimized_memory= is related to GPU-memory optimization:
optimized_memory=0 - there is no additional memory optimization (by default)
optimized_memory=1 - there is optimized delta_gpu, instead of many arrays - it allocates 2 global_delta_gpu & state_delta_gpu arrays which will be used for the most of layers. It doesn't slowdown training, but can work incorrectly on a new models which will be made later.
optimized_memory=2 - also will be used CPU-RAM instead of GPU-VRAM for array output_gpu (output of layer), activation_input_gpu (input of activation) and x_gpu (input of batch-normalization) in each of layer
optimized_memory=3 - also it will use CPU-RAM instead of GPU-VRAM for arrays global_delta_gpu & state_delta_gpu
workspace_size_limit_MB=1000 - will be used 1000 MB for cuDNN-workspace.
For Yolov3-spp 416x416 model on 8GB-GPU and 32GB-CPU-RAM try to use: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg
[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000
I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.
What problem did you encounter?
What GPU do you use?
How many CPU-RAM do you have?
Rename your cfg-file to txt file and attach.
Such accuracy:
MobileNetv3 - Top1 75.37%MixNet-S - Top1 75.68%EfficientNetB0 - Top1 76.3%can be achieved only if you train with very large mini_batch size (~1024):
either you use TPU-cluster ~1M$ or DGX-2 400K$ with synchronized batch-normalization (which slows down training) https://arxiv.org/abs/1711.07240v4
or you use CPU-RAM instead of GPU-RAM which 100x time cheaper, but slows down training more (except IBM Power8-CPU with nVlink between CPU & GPUs) https://github.com/AlexeyAB/darknet/issues/4386
With small mini_batch size (~32) instead of Top1 76.3% we get: https://github.com/AlexeyAB/darknet/issues/3380#issuecomment-501263052
@AlexeyAB
I tried mixnet_m_gpu.cfg with following setting :
optimized_memory=2
workspace_size_limit_MB=1000
I always get the following error:
243 conv 600 1 x 1/ 1 1 x 1 x1200 -> 1 x 1 x 600 0.001 BF
244 conv 1200 1 x 1/ 1 1 x 1 x 600 -> 1 x 1 x1200 0.001 BF
245 scale Layer: 241
246 conv 200/ 2 1 x 1/ 1 9 x 3 x1200 -> 9 x 3 x 200 0.006 BF
247 Shortcut Layer: 231
248 conv 1536 1 x 1/ 1 9 x 3 x 200 -> 9 x 3 x1536 0.017 BF
249 avg 9 x 3 x1536 -> 1536
250 dropout p = 0.25 1536 -> 1536
CUDA status Error: file: ./src/dark_cuda.c : () : line: 423 : build time: Dec 3 2019 - 23:02:36
CUDA Error: invalid argument
CUDA Error: invalid argument: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
Could you help to find out the cause?
@erikguo I fixed it: https://github.com/AlexeyAB/darknet/commit/5d0352f961f4dc3db8ccad0570481c69305c0143
Just tried mixnet_m_gpu.cfg with
[net]
# Training
batch=120
subdivisions=2
optimized_memory=3
workspace_size_limit_MB=1000
Thank you very much!
I will try now.
By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005) in mixnet_m_g.cfg as following:
momentum=0.9
decay=0.00005
It's a special setting for mixnet_m_gpu.cfg ? or just a type error?
@AlexeyAB
@AlexeyAB
Still get error as following:
Pinned block_id = 3, filled = 99.917603 %
241 route 240 238 236 234 -> 9 x 3 x1200
242 avg 9 x 3 x1200 -> 1200
243 conv 600 1 x 1/ 1 1 x 1 x1200 -> 1 x 1 x 600 0.001 BF
244 conv 1200 1 x 1/ 1 1 x 1 x 600 -> 1 x 1 x1200 0.001 BF
245 scale Layer: 241
246 conv 200/ 2 1 x 1/ 1 9 x 3 x1200 -> 9 x 3 x 200 0.006 BF
247 Shortcut Layer: 231
248 conv 1536 1 x 1/ 1 9 x 3 x 200 -> 9 x 3 x1536 0.017 BF
249 avg 9 x 3 x1536 -> 1536
250 dropout p = 0.25 1536 -> 1536
251 conv 51 1 x 1/ 1 1 x 1 x1536 -> 1 x 1 x 51 0.000 BF
252 softmax 51
Pinned block_id = 4, filled = 98.600769 %
Total BFLOPS 0.592
Allocate additional workspace_size = 18.58 MB
Loading weights from backup_all/mixnet_m_gpu_last.weights...
seen 64
Done! Loaded 253 layers from weights-file
Learning Rate: 0.064, Momentum: 0.9, Decay: 0.0005
304734
Loaded: 0.933879 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec 3 2019 - 23:02:38
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
@erikguo Do you get this error if you disable memory optimization?
Comment these lines:
#optimized_memory=3
#workspace_size_limit_MB=1000
By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005)
Since Mixnet is a continuation of the EfficientNet that is a continuation of the (MobileNet ...), in the EfficientNet is used decay=0.00001 https://arxiv.org/pdf/1905.11946v2.pdf
weight decay 1e-5;
After comment these lines, the training is running very well. If using these lines, it can run well occasionally. But It will crach in most of cases.
@AlexeyAB
@erikguo
@AlexeyAB
It will crash at the first iteration.
Crash message is as the following:
Pinned block_id = 3, filled = 99.917603 %
241 route 240 238 236 234 -> 9 x 3 x1200
242 avg 9 x 3 x1200 -> 1200
243 conv 600 1 x 1/ 1 1 x 1 x1200 -> 1 x 1 x 600 0.001 BF
244 conv 1200 1 x 1/ 1 1 x 1 x 600 -> 1 x 1 x1200 0.001 BF
245 scale Layer: 241
246 conv 200/ 2 1 x 1/ 1 9 x 3 x1200 -> 9 x 3 x 200 0.006 BF
247 Shortcut Layer: 231
248 conv 1536 1 x 1/ 1 9 x 3 x 200 -> 9 x 3 x1536 0.017 BF
249 avg 9 x 3 x1536 -> 1536
250 dropout p = 0.25 1536 -> 1536
251 conv 51 1 x 1/ 1 1 x 1 x1536 -> 1 x 1 x 51 0.000 BF
252 softmax 51
Pinned block_id = 4, filled = 98.600769 %
Total BFLOPS 0.592
Allocate additional workspace_size = 18.58 MB
Loading weights from backup_all/mixnet_m_gpu_last.weights...
seen 64
Done! Loaded 253 layers from weights-file
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.104122 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec 3 2019 - 23:02:38
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
已放弃 (æ ¸å¿ƒå·²è½¬å‚¨)
My server has 128G memory, 4 x 1080ti 11G GPU.
Darknet is compiled with GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0
@erikguo
I just trained 2600 iterations successfully on RTX 2070 and CPU Core i7 32 GB CPU-RAM by using this command:
darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.weights -topk
and this cfg-file: mixnet_m_gpu.cfg.txt
I use only one gpu for training.
Command as following:
darknet classifier train dengdi.data mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.cfg -dont_show
batch and subdivsion as following:
batch=128
subdivisions=2
@AlexeyAB
@erikguo
height=96 width=288 ?darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu_mem.cfg backup/mixnet_m_gpu_last.weights -topk

@AlexeyAB
I have tried the following combination:
batch=128
subdivisions=2
running very well now
batch=256
subdivisions=2
running very well now
batch=256
subdivisions=1
running crashed in the first iteration
batch=512
subdivisions=2
running crashed in the first iteration
because my image's aspect is about 1:3 (h:w). So I set the network size with rectangle.
@erikguo
Check this combination:
batch=128
subdivisions=1
batch=256
subdivisions=1
running crashed in the first iteration
My OS is Ubuntu 16.04
this combination is crashed two times and run well one time now. The execution is not stable:
batch=128
subdivisions=1



this combination is bad, always crashed:
batch=256
subdivisions=1



@erikguo Try to use workspace_size_limit_MB=8000
batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000
Error messages are different:
One is: CUDA status Error: file: ./src/blas_kernels.cu : () : line: 576
The other is : CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33
The following setting crashed too. Same error as above.
batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000
Error message:
245 scale Layer: 241
246 conv 200/ 2 1 x 1/ 1 9 x 3 x1200 -> 9 x 3 x 200 0.006 BF
247 Shortcut Layer: 231
248 conv 1536 1 x 1/ 1 9 x 3 x 200 -> 9 x 3 x1536 0.017 BF
249 avg 9 x 3 x1536 -> 1536
250 dropout p = 0.25 1536 -> 1536
251 conv 51 1 x 1/ 1 1 x 1 x1536 -> 1 x 1 x 51 0.000 BF
252 softmax 51
Try to allocate new pinned memory, size = 972 MB
Pinned block_id = 14, filled = 96.900558 %
Try to allocate new pinned BLOCK, size = 81 MB
Pinned block_id = 15, filled = 95.586395 %
Try to allocate new pinned BLOCK, size = 50 MB
Pinned block_id = 16, filled = 99.300003 %
Try to allocate new pinned BLOCK, size = 12 MB
Pinned block_id = 17, filled = 99.920654 %
Try to allocate new pinned BLOCK, size = 7 MB
Total BFLOPS 0.592
Allocate additional workspace_size = 160.59 MB
Loading weights from backup_all/mixnet_m_gpu_last.weights...
seen 64
Done! Loaded 253 layers from weights-file
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.654202 seconds
CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec 3 2019 - 23:02:38
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
@erikguo
Just to localize the problem, try to comment these 2 lines temporary and recompile:
Then try
batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000
After recompiling, the error changed as following:
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.968368 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec 4 2019 - 23:47:36
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
@AlexeyAB
@erikguo Also try to comment this line and recompile: https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/src/network_kernels.cu#L119
@AlexeyAB
After recompiling, run two times with same command and same cfg.
The first error:
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 576 : build time: Dec 5 2019 - 00:02:32
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
The second error:
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec 5 2019 - 00:02:32
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
@erikguo Ok, Thanks I will try to find a bug.
Just to be sure, and you also comment these both lines?
@AlexeyAB ,
After comment and recompile, the error change to :
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 564 : build time: Dec 5 2019 - 00:14:54
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
Now I commented three files:
darknet/src/blas_kernels.cu
darknet/src/network_kernels.cu
darknet/src/dropout_layer_kernels.cu
@erikguo Thanks. Can you compile with DEBUG=1 in the Makefile and run training again? https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/Makefile#L14
@AlexeyAB ,
Errors:
cuDNN status = cudaDeviceSynchronize() Error in: file: ./src/convolutional_kernels.cu : () : line: 823 : build time: Dec 5 2019 - 00:41:31
cuDNN Error: CUDNN_UNKNOWN_STATUS
cuDNN Error: CUDNN_UNKNOWN_STATUS: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
@erikguo Thanks.
Also do you get this issue if you remove [dropout] layer from the end of your cfg-file?
@AlexeyAB
I have comment [dropout] layer from the end of cfg.
Not stable now. Crashed at first times and third time. Run well at second time. Got same error:
CUDA status = cudaDeviceSynchronize() Error: file: ./src/blas_kernels.cu : () : line: 564 : build time: Dec 5 2019 - 00:41:31
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
It seems that it maybe crashed in the mid of first iteration, because I should wait about 15s then it crashed.
@erikguo
Run well at second time.
When it starts up well, will it crash later? Or will it work well until the end?
@AlexeyAB ,
When I said running well, means it can run more than 10 iterations very well and not crashed. I just Ctrl-C to interrupt it to run another time.
@erikguo
When I said running well, means it can run more than 10 iterations very well and not crashed.
So undo all these changes
Compile with DEBUG=0
Set
batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000
And try to run several times, when it starts up well, let it work further, will it crash later?
OK, I will try it. Report back later
@AlexeyAB
I undo all comment done last night. and recompiled.
If I leave [dropout] layer uncommented, it always crash immediately after loading cfg and weight file.
So I commented the [dropout] layer in cfg.
I run several times. It crashed randomly in the first iteration. However, once it can finished the first iteration, it will run well, never crash. But the loss are going to be nan after several iteration, even I lower the learning rate.
The following is the running logs:
Loading weights from backup_all/mixnet_m_gpu_last.weights...
seen 64
Done! Loaded 252 layers from weights-file
Learning Rate: 0.016, Momentum: 0.9, Decay: 5e-05
382473
Loaded: 1.856072 seconds
71457, 47.828: 0.002255, 0.002255 avg, 0.011005 rate, 41.015572 seconds, 18292992 images
Loaded: 0.000049 seconds
71458, 47.829: 0.006014, 0.002631 avg, 0.011005 rate, 42.169849 seconds, 18293248 images
Loaded: 0.000042 seconds
71459, 47.830: 4.707831, 0.473151 avg, 0.011005 rate, 40.191479 seconds, 18293504 images
Loaded: 0.000060 seconds
71460, 47.830: nan, nan avg, 0.011005 rate, 39.108238 seconds, 18293760 images
Loaded: 0.000100 seconds
71461, 47.831: nan, nan avg, 0.011005 rate, 39.716503 seconds, 18294016 images
Loaded: 0.000058 seconds
71462, 47.832: nan, nan avg, 0.011004 rate, 39.298252 seconds, 18294272 images
Loaded: 0.000081 seconds
71463, 47.832: nan, nan avg, 0.011004 rate, 39.715801 seconds, 18294528 images
Loaded: 0.000074 seconds
71464, 47.833: nan, nan avg, 0.011004 rate, 39.716663 seconds, 18294784 images
Loaded: 0.000061 seconds
71465, 47.834: nan, nan avg, 0.011004 rate, 39.147743 seconds, 18295040 images
Loaded: 0.000084 seconds
71466, 47.834: nan, nan avg, 0.011004 rate, 39.735199 seconds, 18295296 images
Loaded: 0.000074 seconds
71467, 47.835: nan, nan avg, 0.011004 rate, 40.027672 seconds, 18295552 images
Loaded: 0.000072 seconds
71468, 47.836: nan, nan avg, 0.011004 rate, 39.932713 seconds, 18295808 images
Loaded: 0.000073 seconds
71469, 47.836: nan, nan avg, 0.011004 rate, 39.481960 seconds, 18296064 images
Loaded: 0.000114 seconds
71470, 47.837: nan, nan avg, 0.011004 rate, 40.012989 seconds, 18296320 images
Loaded: 0.000082 seconds
71471, 47.838: nan, nan avg, 0.011004 rate, 39.614643 seconds, 18296576 images
Loaded: 0.000069 seconds
71472, 47.838: nan, nan avg, 0.011004 rate, 39.501343 seconds, 18296832 images
Loaded: 0.000077 seconds
71473, 47.839: nan, nan avg, 0.011004 rate, 39.760441 seconds, 18297088 images
Loaded: 0.000063 seconds
71474, 47.840: nan, nan avg, 0.011004 rate, 39.416786 seconds, 18297344 images
Loaded: 0.000070 seconds
71475, 47.840: nan, nan avg, 0.011004 rate, 39.673023 seconds, 18297600 images
Loaded: 0.000075 seconds
71476, 47.841: nan, nan avg, 0.011004 rate, 39.329891 seconds, 18297856 images
Loaded: 0.000077 seconds
71477, 47.842: nan, nan avg, 0.011004 rate, 40.461945 seconds, 18298112 images
Loaded: 0.000072 seconds
71478, 47.842: nan, nan avg, 0.011003 rate, 39.966011 seconds, 18298368 images
Loaded: 0.000063 seconds
71479, 47.843: nan, nan avg, 0.011003 rate, 39.231728 seconds, 18298624 images
Loaded: 0.000070 seconds
71480, 47.844: nan, nan avg, 0.011003 rate, 39.738995 seconds, 18298880 images
Loaded: 0.000096 seconds
71481, 47.844: nan, nan avg, 0.011003 rate, 40.647068 seconds, 18299136 images
Loaded: 0.000089 seconds
71482, 47.845: nan, nan avg, 0.011003 rate, 41.785786 seconds, 18299392 images
Loaded: 0.000087 seconds
71483, 47.846: nan, nan avg, 0.011003 rate, 40.824448 seconds, 18299648 images
Loaded: 0.000105 seconds
71484, 47.846: nan, nan avg, 0.011003 rate, 40.963627 seconds, 18299904 images
Loaded: 0.000076 seconds
71485, 47.847: nan, nan avg, 0.011003 rate, 40.498711 seconds, 18300160 images
Loaded: 0.000076 seconds
71486, 47.848: nan, nan avg, 0.011003 rate, 39.802647 seconds, 18300416 images
Loaded: 0.000075 seconds
71487, 47.848: nan, nan avg, 0.011003 rate, 40.423454 seconds, 18300672 images
Loaded: 0.000061 seconds
71488, 47.849: nan, nan avg, 0.011003 rate, 39.450256 seconds, 18300928 images
Loaded: 0.000083 seconds
71489, 47.850: nan, nan avg, 0.011003 rate, 40.406216 seconds, 18301184 images
Loaded: 0.000068 seconds
71490, 47.850: nan, nan avg, 0.011003 rate, 39.633228 seconds, 18301440 images
Loaded: 0.000076 seconds
71491, 47.851: nan, nan avg, 0.011003 rate, 39.777164 seconds, 18301696 images
Loaded: 0.000073 seconds
This is the errors once it crashed:
CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec 5 2019 - 21:46:20
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
And I also test enet-b0-nog.cfg (I remove all groups in [convolution] layers) with the following:
batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000
Even I commented [dropout], it always crashed with error:
CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec 5 2019 - 21:46:20
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
Even I commented [dropout], it always crashed with error:
CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec 5 2019 - 21:46:20 CUDA Error: an illegal memory access was encountered CUDA Error: an illegal memory access was encountered: File exists darknet: ./src/utils.c:295: error: Assertion `0' failed.
This is very strange, how can it crashes in DropOut layer if you commented all DropOut layers.
Just to know, there are many DropOut layers in the EfficientNet
@AlexeyAB ,
you are right that I forgot to comment all [dropout]. After commented all, the errors message:
# running nvidia-smi command found new messages when it crashed:
GPU 00000000:03:00.0: Detected Critical Xid Error
GPU 00000000:03:00.0: Detected Critical Xid Error
# crashed errors:
CUDA status Error: file: ./src/dark_cuda.c : () : line: 446 : build time: Dec 5 2019 - 21:46:18
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
For enet-b0-nog.cfg, I should set batch=96, subdivision=1, then it will run stably with CPU-RAM optimized_memory=3.
I found a case: once it crashed, it's very hard to rerun well. You should wait some time, then try again. Maybe it can run well this time. It look like some memory not be released ? or some remainer in the memory, and wait for some time, OS system clean the remainer automatically
@AlexeyAB
@erikguo
I noticed, that if it is crashed, especially with Out-of-memory (CPU/GPU-memory), then GPU-hardware device can be lost, so you should wait or reboot PC.
For enet-b0-nog.cfg, I should set batch=96, subdivision=1, then it will run stably with CPU-RAM optimized_memory=3.
What CPU-RAM and GPU-VRAM usage do you get?
There's a lot of free CPU memory and GPU memory. After the crashed, I can run the other training very well immediately, but cannot run the CPU-MEM training.
In Windows, when crashed, GPU card will be lost. But in Ubuntu, it won't be lost. I used windows and ubuntu both before.
@AlexeyAB
According to this phenomenon, I guess some memory allocation is 'random', so when the allocation is right, then no crash. Otherwise, It crash.
@erikguo
For the Pinned CPU-RAM should be allocated sequential physical block with 1GB size, so if you have 128 GB CPU-RAM, and you ran 128 applications each of which consumes 1 byte in each of 128 GB, then the Pinned memory can not be allocated at all.
F.e. if you run 64 applications each of which consumes 1 byte in each of 64 GB, then can be allocated only 64 GB Pinned memory.
Maybe this is the reason for this behavior:
According to this phenomenon, I guess some memory allocation is 'random', so when the allocation is right, then no crash. Otherwise, It crash.
So strongly recommended reboot the system before runing Darknet with GPU-processing + CPU-RAM using, and don't load any other applications.
why does it need to be sequential?
@HagegeR
Oh yes, the Pinned CPU-memory blocks (GPU-Direct 1.0) do not have to be completely sequential.
I confused this with GPU-Direct 3.0 (RDMA) when the GPU uses the CPU-memory of the remote computer through the Infiniband - in this case, the mapped memory should be a physically sequential block:
GPU -> PCIe -> Computer_1(PCIeController) -> Infiniband -> Computer_2(PCIeController) -> CPU_RAM
left scheme

@AlexeyAB ,
I see. I will stop other applications on the server and try again at weekend.
BTW, did you run the CPU-MEM training with 4 GPUs together?
@erikguo
BTW, did you run the CPU-MEM training with 4 GPUs together?
No, because there will be required 4x more CPU-RAM for the same mini_batch_size.
It will be 4x faster (if you have 64 - 128 PCIe-lanes on CPU - like AMD Epyc CPU), but it will require 4x more CPU-RAM.
isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.
for(int p=0; p<1000; p++) {
network subnet = parse_network_cfg(cfgfile);
if (weightfile) {
load_weights(&subnet, weightfile);
}
*subnet.seen = 0;
while ( *subnet.seen < train_images_num ) {
pthread_join(load_thread, 0);
train = buffer;
load_thread = load_data(args);
float loss = train_network_waitkey(subnet, train, 0);
free_data(train);
}
int tmp = subnet.batch;
set_batch_network(&subnet, 1);
float map = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, subnet.letter_box, &subnet);
printf("%f", map);
set_batch_network(&subnet, tmp);
free_network(subnet);
}
@kossolax Is it related to optimized_memory=3 and GPU-processing on CPU-RAM? Or just realted to free_network()?
I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue?
@kossolax Yes, start new issue, I will investigate it.
@AlexeyAB Hello,
I think cross iteration batch normalization can achieve similar result but higher training speed.
https://github.com/Howal/Cross-iterationBatchNorm
@WongKinYiu Hi,
I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1
So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN.
But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.
I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8
64/8 = 8
I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36
You can try it on Classifier csresnext50
So inside 1 batch it will average the values of Mean and Variance.
I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.
For using:
[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky
Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.
Paper: https://arxiv.org/abs/2002.05712v2

I used these formulas:


@AlexeyAB
Thank you a lot, i ll give you the feedback after finish training.
@WongKinYiu
I also added dynamic mini batch size when you train with random=1: https://github.com/AlexeyAB/darknet/commit/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1
Just add dynamic_minibatch=1 in the [net] section:
[net]
batch=64
subdivisions=8
dynamic_minibatch=1
width=416
height=416
...
[yolo]
random=1
So
So even if part of CBN will not work properly, you can still use dynamic_minibatch=1 to increase mini_batch size.
0.8 is just a coefficient to avoid out of memory for some network resolutions (sometime cuDNN require much more memory for lower resolution than for higher), but you can try to set 0.9: https://github.com/AlexeyAB/darknet/blob/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1/src/detector.c#L191
Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2)
batch / subdivisions = mini_batch_size
64/8 = 8
63/7 = 9
70/7 = 10
66/6 = 11
60/5 = 12
65/5 = 13
70/5 = 14
60/4 = 15
64/4 = 16
@AlexeyAB OK,
Thank you, SpineNet-49-omega will finish training in half hour.
Will report the result soon.
I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at here
net.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-file
could you tell me why?
I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at herenet.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-filecould you tell me why?
now, I got error
CUDA Error: invalid device pointer: No error
Assertion failed: 0, file ....\src\utils.c, line 325
Just tried to run with this on:
batch=64
subdivisions=4
dynamic_minibatch=1
width=960
height=576
optimized_memory=3
workspace_size_limit_MB=8000
and got this error:
CUDA status Error: file: /home/lucas/Development/darknet/src/dark_cuda.c : () : line: 454 : build time: May 18 2020 - 15:30:02
CUDA Error: invalid device pointer
CUDA Error: invalid device pointer: Resource temporarily unavailable
I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu.
@WongKinYiu @AlexeyAB @cenit @LukeAi
Hi everyone!
Two simple questions I could not find answers everywhere else... Even on google scholar for the second one...
1) ~Is it possible to use dynamic_mini batch=1 while using custom resize of the network eg: "random=1.34"?~
|--> Yes
2) ~Is it possible to use dynamic_mini batch=1 and batch_normalize=2 at the same Time Without messing everything up?~
|--> Yes
3) ~How is it possible that the mini_batch parameter has an influence on mAP with consistent batch size?~
|--> Because Batch normalization is done on Mini-Batch size and not on Batch size.
Has far as my understanding goes, the batch size is the number of samples processed before the weighs update
but mini_batch is just a computational trick to avoid loading and processing the batch in one time and should not have an impact...
I would be very happy with an answer to those questions and I'm sure I am not alone not understanding.
What parameters I can use with nVidia Quadro M1000M (GPU_RAM = 2GB) and I7 + CPU_RAM = 64GB?


###
# Training
batch=64
subdivisions=8
###
width=608
height=608
###
optimized_memory=3
workspace_size_limit_MB=2000
mini_batch=16
Tried to use these, but 100h+ for training - too long.
On other PC with GTX970 4GB and I5 16GB with parameters
###
# Training
batch=64
subdivisions=16
###s
width=608
height=608
I've got ~16-20h of training
Classes=5, max iterations= 10000.
On laptop with settings:
###
# Training
batch=64
subdivisions=32
###
width=608
height=608
### NOT USED ###
# optimized_memory=3
# workspace_size_limit_MB=2000
# mini_batch=16
getting this:

Btw this is Tiny YoloV4
@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.
@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.
I want to speed up training without mAP loss :)
@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU.
Most helpful comment
@AlexeyAB OK,
Thank you, SpineNet-49-omega will finish training in half hour.
Will report the result soon.