DALI benchmark on RN50 is not good?

Created on 1 Sep 2020  路  3Comments  路  Source: NVIDIA/DALI

Hi, guys,
DALI is amazing in Mxnet RN50 example. Thanks for the good work.
I tried to test it with NV TF example but did not observe the great speedup I expected. Not sure is anything is wrong. Here is the recipe I used.
1) https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/ConvNets/resnet50v1.5
2) 20.03-tf1-py3 ngc docker image
3) 1 V100 32GB PCIE card

This is the command I used

python main.py  --arch=resnet50 --mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 --lr_init=0.256
 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 --use_tf_amp --use_static_loss_scaling --loss_scale 128  --data_dir=/data/source_data/build_imagenet_data-rebuild/final_output/tfrecords/  --data_idx_dir=/data/source_data/build_imagenet_data-rebuild/final_output/dali_idx/  --results_dir=./result1/  --use_xla  --use_dali  

This is the log I got

DLL 2020-09-01 06:41:11.984949 - (3, 16135) imgs_per_sec : 1146.8574254443538  cross_entropy : 3.4489827156066895  l2_loss : 0.6694231629371643  total_loss : 4.118405818939209  learning_rate : 0.10318145900964737
DLL 2020-09-01 06:41:12.208649 - (3, 16136) imgs_per_sec : 1146.482712670826  cross_entropy : 3.6176087856292725  l2_loss : 0.6694180369377136  total_loss : 4.287026882171631  learning_rate : 0.10318785160779953
DLL 2020-09-01 06:41:12.431754 - (3, 16137) imgs_per_sec : 1149.5426674945934  cross_entropy : 3.561917781829834  l2_loss : 0.6694126725196838  total_loss : 4.231330394744873  learning_rate : 0.10319424420595169
DLL 2020-09-01 06:41:12.655259 - (3, 16138) imgs_per_sec : 1147.477586212424  cross_entropy : 3.91888165473938  l2_loss : 0.6694076061248779  total_loss : 4.588289260864258  learning_rate : 0.10320064425468445
DLL 2020-09-01 06:41:12.880025 - (3, 16139) imgs_per_sec : 1141.1483137445784  cross_entropy : 3.5024073123931885  l2_loss : 0.6694033741950989  total_loss : 4.171810626983643  learning_rate : 0.10320703685283661
DLL 2020-09-01 06:41:13.103979 - (3, 16140) imgs_per_sec : 1145.8685178564758  cross_entropy : 3.761662721633911  l2_loss : 0.6693999767303467  total_loss : 4.431062698364258  learning_rate : 0.10321342945098877
DLL 2020-09-01 06:41:13.328088 - (3, 16141) imgs_per_sec : 1144.3761533108027  cross_entropy : 3.6656885147094727  l2_loss : 0.6693966388702393  total_loss : 4.335084915161133  learning_rate : 0.10321982204914093
DLL 2020-09-01 06:41:13.551717 - (3, 16142) imgs_per_sec : 1146.9003003593202  cross_entropy : 3.7244443893432617  l2_loss : 0.6693927049636841  total_loss : 4.393836975097656  learning_rate : 0.10322622209787369
DLL 2020-09-01 06:41:13.775232 - (3, 16143) imgs_per_sec : 1147.4506087545672  cross_entropy : 3.568690061569214  l2_loss : 0.6693896651268005  total_loss : 4.23807954788208  learning_rate : 0.10323261469602585
DLL 2020-09-01 06:41:13.998302 - (3, 16144) imgs_per_sec : 1149.7236080876912  cross_entropy : 3.6850883960723877  l2_loss : 0.6693863272666931  total_loss : 4.3544745445251465  learning_rate : 0.10323900729417801
DLL 2020-09-01 06:41:14.221659 - (3, 16145) imgs_per_sec : 1148.253117016305  cross_entropy : 3.4568936824798584  l2_loss : 0.6693829298019409  total_loss : 4.12627649307251  learning_rate : 0.10324540734291077
DLL 2020-09-01 06:41:14.445667 - (3, 16146) imgs_per_sec : 1144.8996255256716  cross_entropy : 3.873098373413086  l2_loss : 0.6693804264068604  total_loss : 4.542478561401367  learning_rate : 0.10325179994106293
DLL 2020-09-01 06:41:14.669444 - (3, 16147) imgs_per_sec : 1146.1265952495826  cross_entropy : 3.503361940383911  l2_loss : 0.6693772077560425  total_loss : 4.172739028930664  learning_rate : 0.10325819253921509
DLL 2020-09-01 06:41:14.892867 - (3, 16148) imgs_per_sec : 1147.904489342982  cross_entropy : 3.650125026702881  l2_loss : 0.6693738102912903  total_loss : 4.3194990158081055  learning_rate : 0.10326459258794785
DLL 2020-09-01 06:41:15.117992 - (3, 16149) imgs_per_sec : 1139.2220277086371  cross_entropy : 3.5760607719421387  l2_loss : 0.669370710849762  total_loss : 4.245431423187256  learning_rate : 0.1032709851861
DLL 2020-09-01 06:41:15.341059 - (3, 16150) imgs_per_sec : 1149.7580794209136  cross_entropy : 3.720097541809082  l2_loss : 0.6693673729896545  total_loss : 4.389464855194092  learning_rate : 0.10327737778425217
DLL 2020-09-01 06:41:15.564920 - (3, 16151) imgs_per_sec : 1145.6460041590335  cross_entropy : 3.505557060241699  l2_loss : 0.6693630218505859  total_loss : 4.174920082092285  learning_rate : 0.10328377783298492

The speed is around 1140-1150 img/sec

If I disable DALI by not using the --use_dali parameter, the speed will increase somehow.
This is the command I used

python main.py  --arch=resnet50 --mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 --lr_init=0.256
 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 --use_tf_amp --use_static_loss_scaling --loss_scale 128  --data_dir=/data/source_data/build_imagenet_data-rebuild/final_output/tfrecords/  --data_idx_dir=/data/source_data/build_imagenet_data-rebuild/final_output/dali_idx/  --results_dir=./result1/  --use_xla 

This is the log I got

DLL 2020-09-01 06:47:48.206248 - (3, 16294) imgs_per_sec : 1169.2598286633365  cross_entropy : 3.4437801837921143  l2_loss : 0.6690835952758789  total_loss : 4.112863540649414  learning_rate : 0.10419823974370956
DLL 2020-09-01 06:47:48.425326 - (3, 16295) imgs_per_sec : 1170.71827206664  cross_entropy : 3.606600284576416  l2_loss : 0.6690845489501953  total_loss : 4.275684833526611  learning_rate : 0.10420463979244232
DLL 2020-09-01 06:47:48.646631 - (3, 16296) imgs_per_sec : 1158.8988720144973  cross_entropy : 3.8573403358459473  l2_loss : 0.669084370136261  total_loss : 4.526424884796143  learning_rate : 0.10421103239059448
DLL 2020-09-01 06:47:48.865526 - (3, 16297) imgs_per_sec : 1171.764621660377  cross_entropy : 3.569573163986206  l2_loss : 0.6690830588340759  total_loss : 4.238656044006348  learning_rate : 0.10421742498874664
DLL 2020-09-01 06:47:49.084512 - (3, 16298) imgs_per_sec : 1171.2175781599392  cross_entropy : 3.657599687576294  l2_loss : 0.6690821647644043  total_loss : 4.326682090759277  learning_rate : 0.1042238250374794
DLL 2020-09-01 06:47:49.306482 - (3, 16299) imgs_per_sec : 1156.3167131171197  cross_entropy : 3.65336537361145  l2_loss : 0.6690803170204163  total_loss : 4.322445869445801  learning_rate : 0.10423021763563156
DLL 2020-09-01 06:47:49.527442 - (3, 16300) imgs_per_sec : 1160.8082465310624  cross_entropy : 3.7053897380828857  l2_loss : 0.6690793037414551  total_loss : 4.374468803405762  learning_rate : 0.10423661023378372
DLL 2020-09-01 06:47:49.747278 - (3, 16301) imgs_per_sec : 1166.6693728893854  cross_entropy : 3.790090560913086  l2_loss : 0.66907799243927  total_loss : 4.459168434143066  learning_rate : 0.10424301028251648
DLL 2020-09-01 06:47:49.969285 - (3, 16302) imgs_per_sec : 1155.2530106741349  cross_entropy : 3.6587588787078857  l2_loss : 0.6690763831138611  total_loss : 4.3278350830078125  learning_rate : 0.10424940288066864
DLL 2020-09-01 06:47:50.189157 - (3, 16303) imgs_per_sec : 1166.526147224054  cross_entropy : 3.611858606338501  l2_loss : 0.6690753698348999  total_loss : 4.280933856964111  learning_rate : 0.1042557954788208
DLL 2020-09-01 06:47:50.408223 - (3, 16304) imgs_per_sec : 1170.7999707774825  cross_entropy : 3.7367801666259766  l2_loss : 0.6690750122070312  total_loss : 4.405855178833008  learning_rate : 0.10426218807697296
DLL 2020-09-01 06:47:50.627595 - (3, 16305) imgs_per_sec : 1169.183437158702  cross_entropy : 3.586914300918579  l2_loss : 0.6690731644630432  total_loss : 4.255987644195557  learning_rate : 0.10426858812570572
DLL 2020-09-01 06:47:50.847042 - (3, 16306) imgs_per_sec : 1168.7469239544057  cross_entropy : 3.79050350189209  l2_loss : 0.6690713763237  total_loss : 4.4595746994018555  learning_rate : 0.10427498072385788

The speed is around 1160-1170 img/sec

Tried to tweak the number of threads(no luck). Just wondering if I missed anything? Thanks.

question

Most helpful comment

This is the blog post that @JanuszL wanted to mention: https://developer.nvidia.com/blog/case-study-resnet50-dali/

All 3 comments

Hi,
That is possible that your single GPU configuration CPU is powerful enough to keep your GPU busy all the time. In that case, using DALI with GPU pipeline just throws even more work on the GPU and slow things down.
You can read more about determining if your training is CPU bottlenecked in this blog post.

This is the blog post that @JanuszL wanted to mention: https://developer.nvidia.com/blog/case-study-resnet50-dali/

Thanks @JanuszL and @klecki . I think it is a reasonable explanation. I will dig more from the blog.

Was this page helpful?
0 / 5 - 0 ratings