Darknet: Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128

Created on 26 Nov 2019 · 69Comments · Source: AlexeyAB/darknet

Higher mini_batch -> higher accuracy mAP/Top1/Top5.

Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.

You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.

Use in your cfg-file:

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

multi-GPU is not tested
random=1 is not supported

Tested:

GeForce RTX 2070 - 8 GB VRAM
CPU Core i7 6700K - 32 GB RAM

Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with wifth=416 height=416 on 8GB_GPU_VRAM + 32GB_CPU_RAM

./darknet detector train data/obj.data yolov3-spp.cfg -map

default: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 sec
optimized_memory=1: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 sec
optimized_memory=2 workspace_size_limit_MB=1000: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 sec
optimized_memory=3 workspace_size_limit_MB=1000: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)

Not well tested yet:

optimized_memory=3 workspace_size_limit_MB=2000: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)
optimized_memory=3 workspace_size_limit_MB=2000 or 4000: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)

mini_batch=24 - 24 GB VRAM RTX Titan - $2500: https://www.amazon.com/NVIDIA-Titan-RTX-Graphics-Card/dp/B07L8YGDL5
mini_batch=48 - 48 GB VRAM Quadro RTX 8000 - $5500: https://www.amazon.com/PNY-VCQRTX8000-PB-NVIDIA-Quadro-Graphic/dp/B07NH3HKG9/
mini_batch=128 - 128 GB RAM - $1700 = RTX 2080 Ti 11 GB - $1100 + $600 CPU-RAM 128 GB = 4x32 + with this software solution
mini_batch=512 - 512 GB RAM - $9200 = 48 GB VRAM Quadro RTX 8000 - $5500 + 512GB=2 x (8 x 32GB), $2600 + $1100 - CPU AMD EPYC 7401P - 32 cores, 16 memory slots up to 2 TB RAM and 128 PCIe® 3.0 lanes + with this software solution
mini_batch=512 - 512 GB VRAM (16 x 32GB Tesla V100) DGX2 - $400 000 https://www.nvidia.com/en-us/data-center/dgx-2/ + with synchronized batch normalization technique solution like: https://arxiv.org/abs/1711.07240v4

Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt

| mini_batch=32 +5 [email protected] | mini_batch=8 |
|---|---|
| chart | chart |
|---|---|

Likely bug ToDo

Source

AlexeyAB

👍4 🎉2 👀1

Most helpful comment

@AlexeyAB OK,

Thank you, SpineNet-49-omega will finish training in half hour.
Will report the result soon.

WongKinYiu on 2 Mar 2020

👍3 👀1

All 69 comments

do you think switching to this higher mini batch after having already train the usual way will give added value as well?

HagegeR on 26 Nov 2019

@HagegeR I didn't test it well. So just try.

In general - yes.

You can try to train the first several % of iterations with large mini_batch,
then continue training with small mini_batch for fast training,
and then continue training the last few percent of iterations with high mini_batch.

AlexeyAB on 26 Nov 2019

👍1

Please could you explain in more detail the meaning of the options or how to work out a good configuration? I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.
What does this mean?
optimized_memory=3
workspace_size_limit_MB=1000

LukeAI on 28 Nov 2019

@LukeAI

Param optimized_memory= is related to GPU-memory optimization:

optimized_memory=0 - there is no additional memory optimization (by default)
optimized_memory=1 - there is optimized delta_gpu, instead of many arrays - it allocates 2 global_delta_gpu & state_delta_gpu arrays which will be used for the most of layers. It doesn't slowdown training, but can work incorrectly on a new models which will be made later.
optimized_memory=2 - also will be used CPU-RAM instead of GPU-VRAM for array output_gpu (output of layer), activation_input_gpu (input of activation) and x_gpu (input of batch-normalization) in each of layer
optimized_memory=3 - also it will use CPU-RAM instead of GPU-VRAM for arrays global_delta_gpu & state_delta_gpu
workspace_size_limit_MB=1000 - will be used 1000 MB for cuDNN-workspace.
- If GPU memory is not enough (CUDA out of memory), then try to reduce this value.
- If Darknet is halted or falls with strange errors - try to increase this value.
- (Try to use 1000 if you have 32 GB CPU-RAM and 2000 if 64 CPU-RAM)
- if GPU is lost - try to reboot your PC

For Yolov3-spp 416x416 model on 8GB-GPU and 32GB-CPU-RAM try to use: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.

What problem did you encounter?

What GPU do you use?
How many CPU-RAM do you have?
Rename your cfg-file to txt file and attach.

AlexeyAB on 28 Nov 2019

Such accuracy:

MobileNetv3 - Top1 75.37%
MixNet-S - Top1 75.68%
EfficientNetB0 - Top1 76.3%

can be achieved only if you train with very large mini_batch size (~1024):

either you use TPU-cluster ~1M$ or DGX-2 400K$ with synchronized batch-normalization (which slows down training) https://arxiv.org/abs/1711.07240v4
or you use CPU-RAM instead of GPU-RAM which 100x time cheaper, but slows down training more (except IBM Power8-CPU with nVlink between CPU & GPUs) https://github.com/AlexeyAB/darknet/issues/4386

With small mini_batch size (~32) instead of Top1 76.3% we get: https://github.com/AlexeyAB/darknet/issues/3380#issuecomment-501263052

Our EfficientNet B0 (224x224) 0.9 BFLOPS - 0.45 B_FMA (16ms / RTX 2070), 4.9M params - 71.3% Top1
Official EfficientNetB0 (224x224) 0.78 BFLOPS - 0.39 FMA, 5.3M params - 70.0% Top1

AlexeyAB on 30 Nov 2019

@AlexeyAB

I tried mixnet_m_gpu.cfg with following setting :

optimized_memory=2
workspace_size_limit_MB=1000

I always get the following error:

 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
CUDA status Error: file: ./src/dark_cuda.c : () : line: 423 : build time: Dec  3 2019 - 23:02:36 
CUDA Error: invalid argument
CUDA Error: invalid argument: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

Could you help to find out the cause?

erikguo on 3 Dec 2019

@erikguo I fixed it: https://github.com/AlexeyAB/darknet/commit/5d0352f961f4dc3db8ccad0570481c69305c0143

Just tried mixnet_m_gpu.cfg with

[net]
# Training
batch=120
subdivisions=2
optimized_memory=3
workspace_size_limit_MB=1000

AlexeyAB on 3 Dec 2019

Thank you very much!

I will try now.

erikguo on 3 Dec 2019

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005) in mixnet_m_g.cfg as following:

momentum=0.9
decay=0.00005

It's a special setting for mixnet_m_gpu.cfg ? or just a type error?

@AlexeyAB

erikguo on 3 Dec 2019

@AlexeyAB

Still get error as following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234                        ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 0.0005
304734
Loaded: 0.933879 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

erikguo on 3 Dec 2019

@erikguo Do you get this error if you disable memory optimization?
Comment these lines:

#optimized_memory=3
#workspace_size_limit_MB=1000

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005)

Since Mixnet is a continuation of the EfficientNet that is a continuation of the (MobileNet ...), in the EfficientNet is used decay=0.00001 https://arxiv.org/pdf/1905.11946v2.pdf

weight decay 1e-5;

AlexeyAB on 3 Dec 2019

After comment these lines, the training is running very well. If using these lines, it can run well occasionally. But It will crach in most of cases.

@AlexeyAB

erikguo on 4 Dec 2019

@erikguo

How many iterations before crashing?
What is the error message?
How many CPU RAM do you have?
What GPU do you use?
Do you use GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=1 ?

AlexeyAB on 4 Dec 2019

@AlexeyAB

It will crash at the first iteration.

Crash message is as the following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234                        ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.104122 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
已放弃 (核心已转储)

My server has 128G memory, 4 x 1080ti 11G GPU.

Darknet is compiled with GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0

erikguo on 4 Dec 2019

@erikguo

Do you use 4 x GPU for training?
What command do you use for training?
What batch and subdivisions did you set?

I just trained 2600 iterations successfully on RTX 2070 and CPU Core i7 32 GB CPU-RAM by using this command:
darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.weights -topk

and this cfg-file: mixnet_m_gpu.cfg.txt

AlexeyAB on 4 Dec 2019

I use only one gpu for training.

Command as following:

darknet classifier train dengdi.data mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.cfg -dont_show

batch and subdivsion as following:

batch=128
subdivisions=2

mixnet_m_gpu_mem.cfg.txt

@AlexeyAB

erikguo on 4 Dec 2019

@erikguo

Why do you use height=96 width=288 ?

I successfully run training with your cfg-file mixnet_m_gpu_mem.cfg.txt on RTX 2070 8 GB-VRAM + 32 GB CPU_RAM
darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu_mem.cfg backup/mixnet_m_gpu_last.weights -topk

AlexeyAB on 4 Dec 2019

@AlexeyAB

I have tried the following combination:

batch=128
subdivisions=2
running very well now

batch=256
subdivisions=2
running very well now

batch=256
subdivisions=1
running crashed in the first iteration

batch=512
subdivisions=2
running crashed in the first iteration

erikguo on 4 Dec 2019

because my image's aspect is about 1:3 (h:w). So I set the network size with rectangle.

erikguo on 4 Dec 2019

@erikguo
Check this combination:
batch=128
subdivisions=1

batch=256
subdivisions=1
running crashed in the first iteration

Show screenshot of CPU_RAM usage
Show screenshot of GPU_RAM usage
Show screenshot of the error message

AlexeyAB on 4 Dec 2019

My OS is Ubuntu 16.04

this combination is crashed two times and run well one time now. The execution is not stable:
batch=128
subdivisions=1

this combination is bad, always crashed:
batch=256
subdivisions=1

erikguo on 4 Dec 2019

@erikguo Try to use workspace_size_limit_MB=8000

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

AlexeyAB on 4 Dec 2019

Error messages are different:
One is: CUDA status Error: file: ./src/blas_kernels.cu : () : line: 576
The other is : CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33

erikguo on 4 Dec 2019

The following setting crashed too. Same error as above.
batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

Error message:

245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51
Try to allocate new pinned memory, size = 972 MB 

 Pinned block_id = 14, filled = 96.900558 % 
Try to allocate new pinned BLOCK, size = 81 MB 

 Pinned block_id = 15, filled = 95.586395 % 
Try to allocate new pinned BLOCK, size = 50 MB 

 Pinned block_id = 16, filled = 99.300003 % 
Try to allocate new pinned BLOCK, size = 12 MB 

 Pinned block_id = 17, filled = 99.920654 % 
Try to allocate new pinned BLOCK, size = 7 MB 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 160.59 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.654202 seconds
CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

erikguo on 4 Dec 2019

@erikguo

Just to localize the problem, try to comment these 2 lines temporary and recompile:

Then try

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

AlexeyAB on 4 Dec 2019

After recompiling, the error changed as following:

Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.968368 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  4 2019 - 23:47:36 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

@AlexeyAB

erikguo on 4 Dec 2019

@erikguo Also try to comment this line and recompile: https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/src/network_kernels.cu#L119

AlexeyAB on 4 Dec 2019

@AlexeyAB

After recompiling, run two times with same command and same cfg.

The first error:

CUDA status Error: file: ./src/blas_kernels.cu : () : line: 576 : build time: Dec  5 2019 - 00:02:32 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists

The second error:

CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  5 2019 - 00:02:32 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists

erikguo on 4 Dec 2019

@erikguo Ok, Thanks I will try to find a bug.
Just to be sure, and you also comment these both lines?

AlexeyAB on 4 Dec 2019

@AlexeyAB ,

After comment and recompile, the error change to ：

CUDA status Error: file: ./src/blas_kernels.cu : () : line: 564 : build time: Dec  5 2019 - 00:14:54 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists

Now I commented three files:
darknet/src/blas_kernels.cu
darknet/src/network_kernels.cu
darknet/src/dropout_layer_kernels.cu

erikguo on 4 Dec 2019

@erikguo Thanks. Can you compile with DEBUG=1 in the Makefile and run training again? https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/Makefile#L14

AlexeyAB on 4 Dec 2019

@AlexeyAB ,

Errors:

 cuDNN status = cudaDeviceSynchronize() Error in: file: ./src/convolutional_kernels.cu : () : line: 823 : build time: Dec  5 2019 - 00:41:31 
cuDNN Error: CUDNN_UNKNOWN_STATUS
cuDNN Error: CUDNN_UNKNOWN_STATUS: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

erikguo on 4 Dec 2019

👍1

@erikguo Thanks.

Also do you get this issue if you remove [dropout] layer from the end of your cfg-file?

AlexeyAB on 4 Dec 2019

@AlexeyAB

I have comment [dropout] layer from the end of cfg.

Not stable now. Crashed at first times and third time. Run well at second time. Got same error:

CUDA status = cudaDeviceSynchronize() Error: file: ./src/blas_kernels.cu : () : line: 564 : build time: Dec  5 2019 - 00:41:31 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

It seems that it maybe crashed in the mid of first iteration, because I should wait about 15s then it crashed.

erikguo on 5 Dec 2019

@erikguo

Run well at second time.

When it starts up well, will it crash later? Or will it work well until the end?

AlexeyAB on 5 Dec 2019

@AlexeyAB ,

When I said running well, means it can run more than 10 iterations very well and not crashed. I just Ctrl-C to interrupt it to run another time.

erikguo on 5 Dec 2019

@erikguo

When I said running well, means it can run more than 10 iterations very well and not crashed.

So undo all these changes

https://github.com/AlexeyAB/darknet/issues/4386#issuecomment-561714587
https://github.com/AlexeyAB/darknet/issues/4386#issuecomment-561701749
https://github.com/AlexeyAB/darknet/issues/4386#issuecomment-561709090

Compile with DEBUG=0

Set

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

And try to run several times, when it starts up well, let it work further, will it crash later?

AlexeyAB on 5 Dec 2019

OK, I will try it. Report back later

erikguo on 5 Dec 2019

@AlexeyAB

I undo all comment done last night. and recompiled.

If I leave [dropout] layer uncommented, it always crash immediately after loading cfg and weight file.
So I commented the [dropout] layer in cfg.

I run several times. It crashed randomly in the first iteration. However, once it can finished the first iteration, it will run well, never crash. But the loss are going to be nan after several iteration, even I lower the learning rate.

The following is the running logs:

Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 252 layers from weights-file 
Learning Rate: 0.016, Momentum: 0.9, Decay: 5e-05
382473
Loaded: 1.856072 seconds
71457, 47.828: 0.002255, 0.002255 avg, 0.011005 rate, 41.015572 seconds, 18292992 images
Loaded: 0.000049 seconds
71458, 47.829: 0.006014, 0.002631 avg, 0.011005 rate, 42.169849 seconds, 18293248 images
Loaded: 0.000042 seconds
71459, 47.830: 4.707831, 0.473151 avg, 0.011005 rate, 40.191479 seconds, 18293504 images
Loaded: 0.000060 seconds
71460, 47.830: nan, nan avg, 0.011005 rate, 39.108238 seconds, 18293760 images
Loaded: 0.000100 seconds
71461, 47.831: nan, nan avg, 0.011005 rate, 39.716503 seconds, 18294016 images
Loaded: 0.000058 seconds
71462, 47.832: nan, nan avg, 0.011004 rate, 39.298252 seconds, 18294272 images
Loaded: 0.000081 seconds
71463, 47.832: nan, nan avg, 0.011004 rate, 39.715801 seconds, 18294528 images
Loaded: 0.000074 seconds
71464, 47.833: nan, nan avg, 0.011004 rate, 39.716663 seconds, 18294784 images
Loaded: 0.000061 seconds
71465, 47.834: nan, nan avg, 0.011004 rate, 39.147743 seconds, 18295040 images
Loaded: 0.000084 seconds
71466, 47.834: nan, nan avg, 0.011004 rate, 39.735199 seconds, 18295296 images
Loaded: 0.000074 seconds
71467, 47.835: nan, nan avg, 0.011004 rate, 40.027672 seconds, 18295552 images
Loaded: 0.000072 seconds
71468, 47.836: nan, nan avg, 0.011004 rate, 39.932713 seconds, 18295808 images
Loaded: 0.000073 seconds
71469, 47.836: nan, nan avg, 0.011004 rate, 39.481960 seconds, 18296064 images
Loaded: 0.000114 seconds
71470, 47.837: nan, nan avg, 0.011004 rate, 40.012989 seconds, 18296320 images
Loaded: 0.000082 seconds
71471, 47.838: nan, nan avg, 0.011004 rate, 39.614643 seconds, 18296576 images
Loaded: 0.000069 seconds
71472, 47.838: nan, nan avg, 0.011004 rate, 39.501343 seconds, 18296832 images
Loaded: 0.000077 seconds
71473, 47.839: nan, nan avg, 0.011004 rate, 39.760441 seconds, 18297088 images
Loaded: 0.000063 seconds
71474, 47.840: nan, nan avg, 0.011004 rate, 39.416786 seconds, 18297344 images
Loaded: 0.000070 seconds
71475, 47.840: nan, nan avg, 0.011004 rate, 39.673023 seconds, 18297600 images
Loaded: 0.000075 seconds
71476, 47.841: nan, nan avg, 0.011004 rate, 39.329891 seconds, 18297856 images
Loaded: 0.000077 seconds
71477, 47.842: nan, nan avg, 0.011004 rate, 40.461945 seconds, 18298112 images
Loaded: 0.000072 seconds
71478, 47.842: nan, nan avg, 0.011003 rate, 39.966011 seconds, 18298368 images
Loaded: 0.000063 seconds
71479, 47.843: nan, nan avg, 0.011003 rate, 39.231728 seconds, 18298624 images
Loaded: 0.000070 seconds
71480, 47.844: nan, nan avg, 0.011003 rate, 39.738995 seconds, 18298880 images
Loaded: 0.000096 seconds
71481, 47.844: nan, nan avg, 0.011003 rate, 40.647068 seconds, 18299136 images
Loaded: 0.000089 seconds
71482, 47.845: nan, nan avg, 0.011003 rate, 41.785786 seconds, 18299392 images
Loaded: 0.000087 seconds
71483, 47.846: nan, nan avg, 0.011003 rate, 40.824448 seconds, 18299648 images
Loaded: 0.000105 seconds
71484, 47.846: nan, nan avg, 0.011003 rate, 40.963627 seconds, 18299904 images
Loaded: 0.000076 seconds
71485, 47.847: nan, nan avg, 0.011003 rate, 40.498711 seconds, 18300160 images
Loaded: 0.000076 seconds
71486, 47.848: nan, nan avg, 0.011003 rate, 39.802647 seconds, 18300416 images
Loaded: 0.000075 seconds
71487, 47.848: nan, nan avg, 0.011003 rate, 40.423454 seconds, 18300672 images
Loaded: 0.000061 seconds
71488, 47.849: nan, nan avg, 0.011003 rate, 39.450256 seconds, 18300928 images
Loaded: 0.000083 seconds
71489, 47.850: nan, nan avg, 0.011003 rate, 40.406216 seconds, 18301184 images
Loaded: 0.000068 seconds
71490, 47.850: nan, nan avg, 0.011003 rate, 39.633228 seconds, 18301440 images
Loaded: 0.000076 seconds
71491, 47.851: nan, nan avg, 0.011003 rate, 39.777164 seconds, 18301696 images
Loaded: 0.000073 seconds

This is the errors once it crashed:

CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  5 2019 - 21:46:20 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

And I also test enet-b0-nog.cfg (I remove all groups in [convolution] layers) with the following:

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

Even I commented [dropout], it always crashed with error:

CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  5 2019 - 21:46:20 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

erikguo on 5 Dec 2019

👍1

Even I commented [dropout], it always crashed with error:

CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  5 2019 - 21:46:20 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

This is very strange, how can it crashes in DropOut layer if you commented all DropOut layers.
Just to know, there are many DropOut layers in the EfficientNet

AlexeyAB on 5 Dec 2019

@AlexeyAB ,

you are right that I forgot to comment all [dropout]. After commented all, the errors message:

# running nvidia-smi command found new messages when it crashed:
GPU 00000000:03:00.0: Detected Critical Xid Error
GPU 00000000:03:00.0: Detected Critical Xid Error

# crashed errors:
CUDA status Error: file: ./src/dark_cuda.c : () : line: 446 : build time: Dec  5 2019 - 21:46:18 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

erikguo on 5 Dec 2019

For enet-b0-nog.cfg, I should set batch=96, subdivision=1, then it will run stably with CPU-RAM optimized_memory=3.

erikguo on 5 Dec 2019

I found a case: once it crashed, it's very hard to rerun well. You should wait some time, then try again. Maybe it can run well this time. It look like some memory not be released ? or some remainer in the memory, and wait for some time, OS system clean the remainer automatically

@AlexeyAB

erikguo on 5 Dec 2019

@erikguo
I noticed, that if it is crashed, especially with Out-of-memory (CPU/GPU-memory), then GPU-hardware device can be lost, so you should wait or reboot PC.

For enet-b0-nog.cfg, I should set batch=96, subdivision=1, then it will run stably with CPU-RAM optimized_memory=3.

What CPU-RAM and GPU-VRAM usage do you get?

AlexeyAB on 5 Dec 2019

There's a lot of free CPU memory and GPU memory. After the crashed, I can run the other training very well immediately, but cannot run the CPU-MEM training.

In Windows, when crashed, GPU card will be lost. But in Ubuntu, it won't be lost. I used windows and ubuntu both before.

@AlexeyAB

erikguo on 5 Dec 2019

According to this phenomenon, I guess some memory allocation is 'random', so when the allocation is right, then no crash. Otherwise, It crash.

erikguo on 5 Dec 2019

@erikguo

For the Pinned CPU-RAM should be allocated sequential physical block with 1GB size, so if you have 128 GB CPU-RAM, and you ran 128 applications each of which consumes 1 byte in each of 128 GB, then the Pinned memory can not be allocated at all.
F.e. if you run 64 applications each of which consumes 1 byte in each of 64 GB, then can be allocated only 64 GB Pinned memory.

Maybe this is the reason for this behavior:

According to this phenomenon, I guess some memory allocation is 'random', so when the allocation is right, then no crash. Otherwise, It crash.

So strongly recommended reboot the system before runing Darknet with GPU-processing + CPU-RAM using, and don't load any other applications.

AlexeyAB on 5 Dec 2019

why does it need to be sequential?

HagegeR on 5 Dec 2019

@HagegeR

Oh yes, the Pinned CPU-memory blocks (GPU-Direct 1.0) do not have to be completely sequential.
I confused this with GPU-Direct 3.0 (RDMA) when the GPU uses the CPU-memory of the remote computer through the Infiniband - in this case, the mapped memory should be a physically sequential block:
GPU -> PCIe -> Computer_1(PCIeController) -> Infiniband -> Computer_2(PCIeController) -> CPU_RAM

left scheme
RDMA_P2P_bars

AlexeyAB on 5 Dec 2019

@AlexeyAB ,

I see. I will stop other applications on the server and try again at weekend.

BTW, did you run the CPU-MEM training with 4 GPUs together?

erikguo on 6 Dec 2019

@erikguo

BTW, did you run the CPU-MEM training with 4 GPUs together?

No, because there will be required 4x more CPU-RAM for the same mini_batch_size.
It will be 4x faster (if you have 64 - 128 PCIe-lanes on CPU - like AMD Epyc CPU), but it will require 4x more CPU-RAM.

AlexeyAB on 6 Dec 2019

isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.

for(int p=0; p<1000; p++) {

        network subnet = parse_network_cfg(cfgfile);
        if (weightfile) {
            load_weights(&subnet, weightfile);
        }

        *subnet.seen = 0;

        while ( *subnet.seen < train_images_num ) {

            pthread_join(load_thread, 0);
            train = buffer;
            load_thread = load_data(args);

            float loss = train_network_waitkey(subnet, train, 0);
            free_data(train);
        }

        int tmp = subnet.batch;

        set_batch_network(&subnet, 1);
        float map = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, subnet.letter_box, &subnet);
    printf("%f", map);

        set_batch_network(&subnet, tmp);

        free_network(subnet);
}

kossolax on 22 Jan 2020

@kossolax Is it related to optimized_memory=3 and GPU-processing on CPU-RAM? Or just realted to free_network()?

AlexeyAB on 22 Jan 2020

I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue?

kossolax on 22 Jan 2020

@kossolax Yes, start new issue, I will investigate it.

AlexeyAB on 22 Jan 2020

@AlexeyAB Hello,

I think cross iteration batch normalization can achieve similar result but higher training speed.
https://github.com/Howal/Cross-iterationBatchNorm

WongKinYiu on 19 Feb 2020

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1
So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN.
But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8
64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50

So inside 1 batch it will average the values of Mean and Variance.
I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

For the 1st mini_batch will use Mean[1] & Variance[1]
For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])
For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3])
....

For using:

[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky

Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

I used these formulas:

AlexeyAB on 21 Feb 2020

👀2 🚀2 👍1

@AlexeyAB

Thank you a lot, i ll give you the feedback after finish training.

WongKinYiu on 21 Feb 2020

@WongKinYiu

I also added dynamic mini batch size when you train with random=1: https://github.com/AlexeyAB/darknet/commit/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1

Just add dynamic_minibatch=1 in the [net] section:

[net]
batch=64
subdivisions=8
dynamic_minibatch=1
width=416
height=416

...
[yolo]
random=1

network resolution will be 288x288 - 608x608 due to random=1
for 608x608 the mini batch size = batch/subdivisions = 8
for 416x416 the mini batch size = 0.8 x ((608x608)/(416x416)) x batch/subdivisions = 13
for 288x288 the mini batch size = 0.8 x ((608x608)/(288x288)) x batch/subdivisions = 28

So even if part of CBN will not work properly, you can still use dynamic_minibatch=1 to increase mini_batch size.

0.8 is just a coefficient to avoid out of memory for some network resolutions (sometime cuDNN require much more memory for lower resolution than for higher), but you can try to set 0.9: https://github.com/AlexeyAB/darknet/blob/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1/src/detector.c#L191

Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2)
batch / subdivisions = mini_batch_size
64/8 = 8
63/7 = 9
70/7 = 10
66/6 = 11
60/5 = 12
65/5 = 13
70/5 = 14
60/4 = 15
64/4 = 16

AlexeyAB on 2 Mar 2020

👍2

@AlexeyAB OK,

Thank you, SpineNet-49-omega will finish training in half hour.
Will report the result soon.

WongKinYiu on 2 Mar 2020

👍3 👀1

I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at here

net.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-file

could you tell me why?

Answergeng on 23 Mar 2020

I tried yolov3-spp.cfg with following setting :
optimized_memory=3
workspace_size_limit_MB=1000
my cpu-ram is 64g, after loading use 20.9g
but always stuck at here

net.optimized_memory = 3
batch = 1, time_steps = 1, train = 0
yolov3-spp
net.optimized_memory = 3
pre_allocate... pinned_ptr = 0000000000000000
pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
Allocated 1073741824 pinned block
batch = 8, time_steps = 1, train = 1
Pinned block_id = 0, filled = 88.134911 %
Pinned block_id = 1, filled = 96.948578 %
Pinned block_id = 2, filled = 96.949005 %
Pinned block_id = 3, filled = 99.152946 %
Pinned block_id = 4, filled = 99.153809 %
Pinned block_id = 5, filled = 98.830368 %
Pinned block_id = 6, filled = 99.875595 %
Done! Loaded 85 layers from weights-file

could you tell me why?

now, I got error

CUDA Error: invalid device pointer: No error
Assertion failed: 0, file ....\src\utils.c, line 325

Answergeng on 23 Mar 2020

Just tried to run with this on:

batch=64
subdivisions=4
dynamic_minibatch=1
width=960
height=576
optimized_memory=3
workspace_size_limit_MB=8000

and got this error:

CUDA status Error: file: /home/lucas/Development/darknet/src/dark_cuda.c : () : line: 454 : build time: May 18 2020 - 15:30:02 

 CUDA Error: invalid device pointer
CUDA Error: invalid device pointer: Resource temporarily unavailable

I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu.

LucasSloan on 28 May 2020

@WongKinYiu @AlexeyAB @cenit @LukeAi

Hi everyone!
Two simple questions I could not find answers everywhere else... Even on google scholar for the second one...

1) ~Is it possible to use dynamic_mini batch=1 while using custom resize of the network eg: "random=1.34"?~
|--> Yes
2) ~Is it possible to use dynamic_mini batch=1 and batch_normalize=2 at the same Time Without messing everything up?~
|--> Yes
3) ~How is it possible that the mini_batch parameter has an influence on mAP with consistent batch size?~
|--> Because Batch normalization is done on Mini-Batch size and not on Batch size.

Has far as my understanding goes, the batch size is the number of samples processed before the weighs update
but mini_batch is just a computational trick to avoid loading and processing the batch in one time and should not have an impact...

I would be very happy with an answer to those questions and I'm sure I am not alone not understanding.

arnaud-nt2i on 4 Sep 2020

What parameters I can use with nVidia Quadro M1000M (GPU_RAM = 2GB) and I7 + CPU_RAM = 64GB?

###
# Training
batch=64
subdivisions=8

###
width=608
height=608

###
optimized_memory=3
workspace_size_limit_MB=2000
mini_batch=16

Tried to use these, but 100h+ for training - too long.

On other PC with GTX970 4GB and I5 16GB with parameters

###
# Training
batch=64
subdivisions=16

###s
width=608
height=608

I've got ~16-20h of training

Classes=5, max iterations= 10000.

igoriok1994 on 17 Nov 2020

On laptop with settings:

###
# Training
batch=64
subdivisions=32

###
width=608
height=608

### NOT USED ###
# optimized_memory=3
# workspace_size_limit_MB=2000
# mini_batch=16

getting this:

Btw this is Tiny YoloV4

igoriok1994 on 17 Nov 2020

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

pullmyleg on 17 Nov 2020

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

I want to speed up training without mAP loss :)

igoriok1994 on 17 Nov 2020

@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU.

pullmyleg on 17 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings