Hi, guys,
DALI is amazing in Mxnet RN50 example. Thanks for the good work.
I tried to test it with NV TF example but did not observe the great speedup I expected. Not sure is anything is wrong. Here is the recipe I used.
1) https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/ConvNets/resnet50v1.5
2) 20.03-tf1-py3 ngc docker image
3) 1 V100 32GB PCIE card
This is the command I used
python main.py --arch=resnet50 --mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 --lr_init=0.256
--lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 --use_tf_amp --use_static_loss_scaling --loss_scale 128 --data_dir=/data/source_data/build_imagenet_data-rebuild/final_output/tfrecords/ --data_idx_dir=/data/source_data/build_imagenet_data-rebuild/final_output/dali_idx/ --results_dir=./result1/ --use_xla --use_dali
This is the log I got
DLL 2020-09-01 06:41:11.984949 - (3, 16135) imgs_per_sec : 1146.8574254443538 cross_entropy : 3.4489827156066895 l2_loss : 0.6694231629371643 total_loss : 4.118405818939209 learning_rate : 0.10318145900964737
DLL 2020-09-01 06:41:12.208649 - (3, 16136) imgs_per_sec : 1146.482712670826 cross_entropy : 3.6176087856292725 l2_loss : 0.6694180369377136 total_loss : 4.287026882171631 learning_rate : 0.10318785160779953
DLL 2020-09-01 06:41:12.431754 - (3, 16137) imgs_per_sec : 1149.5426674945934 cross_entropy : 3.561917781829834 l2_loss : 0.6694126725196838 total_loss : 4.231330394744873 learning_rate : 0.10319424420595169
DLL 2020-09-01 06:41:12.655259 - (3, 16138) imgs_per_sec : 1147.477586212424 cross_entropy : 3.91888165473938 l2_loss : 0.6694076061248779 total_loss : 4.588289260864258 learning_rate : 0.10320064425468445
DLL 2020-09-01 06:41:12.880025 - (3, 16139) imgs_per_sec : 1141.1483137445784 cross_entropy : 3.5024073123931885 l2_loss : 0.6694033741950989 total_loss : 4.171810626983643 learning_rate : 0.10320703685283661
DLL 2020-09-01 06:41:13.103979 - (3, 16140) imgs_per_sec : 1145.8685178564758 cross_entropy : 3.761662721633911 l2_loss : 0.6693999767303467 total_loss : 4.431062698364258 learning_rate : 0.10321342945098877
DLL 2020-09-01 06:41:13.328088 - (3, 16141) imgs_per_sec : 1144.3761533108027 cross_entropy : 3.6656885147094727 l2_loss : 0.6693966388702393 total_loss : 4.335084915161133 learning_rate : 0.10321982204914093
DLL 2020-09-01 06:41:13.551717 - (3, 16142) imgs_per_sec : 1146.9003003593202 cross_entropy : 3.7244443893432617 l2_loss : 0.6693927049636841 total_loss : 4.393836975097656 learning_rate : 0.10322622209787369
DLL 2020-09-01 06:41:13.775232 - (3, 16143) imgs_per_sec : 1147.4506087545672 cross_entropy : 3.568690061569214 l2_loss : 0.6693896651268005 total_loss : 4.23807954788208 learning_rate : 0.10323261469602585
DLL 2020-09-01 06:41:13.998302 - (3, 16144) imgs_per_sec : 1149.7236080876912 cross_entropy : 3.6850883960723877 l2_loss : 0.6693863272666931 total_loss : 4.3544745445251465 learning_rate : 0.10323900729417801
DLL 2020-09-01 06:41:14.221659 - (3, 16145) imgs_per_sec : 1148.253117016305 cross_entropy : 3.4568936824798584 l2_loss : 0.6693829298019409 total_loss : 4.12627649307251 learning_rate : 0.10324540734291077
DLL 2020-09-01 06:41:14.445667 - (3, 16146) imgs_per_sec : 1144.8996255256716 cross_entropy : 3.873098373413086 l2_loss : 0.6693804264068604 total_loss : 4.542478561401367 learning_rate : 0.10325179994106293
DLL 2020-09-01 06:41:14.669444 - (3, 16147) imgs_per_sec : 1146.1265952495826 cross_entropy : 3.503361940383911 l2_loss : 0.6693772077560425 total_loss : 4.172739028930664 learning_rate : 0.10325819253921509
DLL 2020-09-01 06:41:14.892867 - (3, 16148) imgs_per_sec : 1147.904489342982 cross_entropy : 3.650125026702881 l2_loss : 0.6693738102912903 total_loss : 4.3194990158081055 learning_rate : 0.10326459258794785
DLL 2020-09-01 06:41:15.117992 - (3, 16149) imgs_per_sec : 1139.2220277086371 cross_entropy : 3.5760607719421387 l2_loss : 0.669370710849762 total_loss : 4.245431423187256 learning_rate : 0.1032709851861
DLL 2020-09-01 06:41:15.341059 - (3, 16150) imgs_per_sec : 1149.7580794209136 cross_entropy : 3.720097541809082 l2_loss : 0.6693673729896545 total_loss : 4.389464855194092 learning_rate : 0.10327737778425217
DLL 2020-09-01 06:41:15.564920 - (3, 16151) imgs_per_sec : 1145.6460041590335 cross_entropy : 3.505557060241699 l2_loss : 0.6693630218505859 total_loss : 4.174920082092285 learning_rate : 0.10328377783298492
The speed is around 1140-1150 img/sec
If I disable DALI by not using the --use_dali parameter, the speed will increase somehow.
This is the command I used
python main.py --arch=resnet50 --mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 --lr_init=0.256
--lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 --use_tf_amp --use_static_loss_scaling --loss_scale 128 --data_dir=/data/source_data/build_imagenet_data-rebuild/final_output/tfrecords/ --data_idx_dir=/data/source_data/build_imagenet_data-rebuild/final_output/dali_idx/ --results_dir=./result1/ --use_xla
This is the log I got
DLL 2020-09-01 06:47:48.206248 - (3, 16294) imgs_per_sec : 1169.2598286633365 cross_entropy : 3.4437801837921143 l2_loss : 0.6690835952758789 total_loss : 4.112863540649414 learning_rate : 0.10419823974370956
DLL 2020-09-01 06:47:48.425326 - (3, 16295) imgs_per_sec : 1170.71827206664 cross_entropy : 3.606600284576416 l2_loss : 0.6690845489501953 total_loss : 4.275684833526611 learning_rate : 0.10420463979244232
DLL 2020-09-01 06:47:48.646631 - (3, 16296) imgs_per_sec : 1158.8988720144973 cross_entropy : 3.8573403358459473 l2_loss : 0.669084370136261 total_loss : 4.526424884796143 learning_rate : 0.10421103239059448
DLL 2020-09-01 06:47:48.865526 - (3, 16297) imgs_per_sec : 1171.764621660377 cross_entropy : 3.569573163986206 l2_loss : 0.6690830588340759 total_loss : 4.238656044006348 learning_rate : 0.10421742498874664
DLL 2020-09-01 06:47:49.084512 - (3, 16298) imgs_per_sec : 1171.2175781599392 cross_entropy : 3.657599687576294 l2_loss : 0.6690821647644043 total_loss : 4.326682090759277 learning_rate : 0.1042238250374794
DLL 2020-09-01 06:47:49.306482 - (3, 16299) imgs_per_sec : 1156.3167131171197 cross_entropy : 3.65336537361145 l2_loss : 0.6690803170204163 total_loss : 4.322445869445801 learning_rate : 0.10423021763563156
DLL 2020-09-01 06:47:49.527442 - (3, 16300) imgs_per_sec : 1160.8082465310624 cross_entropy : 3.7053897380828857 l2_loss : 0.6690793037414551 total_loss : 4.374468803405762 learning_rate : 0.10423661023378372
DLL 2020-09-01 06:47:49.747278 - (3, 16301) imgs_per_sec : 1166.6693728893854 cross_entropy : 3.790090560913086 l2_loss : 0.66907799243927 total_loss : 4.459168434143066 learning_rate : 0.10424301028251648
DLL 2020-09-01 06:47:49.969285 - (3, 16302) imgs_per_sec : 1155.2530106741349 cross_entropy : 3.6587588787078857 l2_loss : 0.6690763831138611 total_loss : 4.3278350830078125 learning_rate : 0.10424940288066864
DLL 2020-09-01 06:47:50.189157 - (3, 16303) imgs_per_sec : 1166.526147224054 cross_entropy : 3.611858606338501 l2_loss : 0.6690753698348999 total_loss : 4.280933856964111 learning_rate : 0.1042557954788208
DLL 2020-09-01 06:47:50.408223 - (3, 16304) imgs_per_sec : 1170.7999707774825 cross_entropy : 3.7367801666259766 l2_loss : 0.6690750122070312 total_loss : 4.405855178833008 learning_rate : 0.10426218807697296
DLL 2020-09-01 06:47:50.627595 - (3, 16305) imgs_per_sec : 1169.183437158702 cross_entropy : 3.586914300918579 l2_loss : 0.6690731644630432 total_loss : 4.255987644195557 learning_rate : 0.10426858812570572
DLL 2020-09-01 06:47:50.847042 - (3, 16306) imgs_per_sec : 1168.7469239544057 cross_entropy : 3.79050350189209 l2_loss : 0.6690713763237 total_loss : 4.4595746994018555 learning_rate : 0.10427498072385788
The speed is around 1160-1170 img/sec
Tried to tweak the number of threads(no luck). Just wondering if I missed anything? Thanks.
Hi,
That is possible that your single GPU configuration CPU is powerful enough to keep your GPU busy all the time. In that case, using DALI with GPU pipeline just throws even more work on the GPU and slow things down.
You can read more about determining if your training is CPU bottlenecked in this blog post.
This is the blog post that @JanuszL wanted to mention: https://developer.nvidia.com/blog/case-study-resnet50-dali/
Thanks @JanuszL and @klecki . I think it is a reasonable explanation. I will dig more from the blog.
Most helpful comment
This is the blog post that @JanuszL wanted to mention: https://developer.nvidia.com/blog/case-study-resnet50-dali/