tf serving version:
TensorFlow ModelServer: 0.0.0+dev.sha.718ad43
TensorFlow Library: 1.12.0
model: resnet_v2_fp32_savedmodel_NHWC.tar.gz ResNet-50
1: " --config=mkl --config=nativeopt" avg latency: 184.8766 ms
2: "--config=nativeopt" avg latency: 67.2201 ms
3: "--config=mkl" avg latency: 184.8249 ms
what can i do for this?, why build with mkl infernece-latency become longer?
Could you describe the issue clearly. A clear explanation on the steps you followed and the exact commands used(if applicable) would really help to look into the issue.
Hi @hgadig ;Thanks for your advice.
I just started learning tensorflow. My purpose is use intel-cpu to run the tensor-serving,so with cpu-Optimizing-configuration(mkl and nativeopt) build the tensorflow_model_server by source code.
1. tools/run_in_docker.sh bazel build --color=yes --curses=yes --config=mkl --config=nativeopt --verbose_failures tensorflow_serving/model_servers:tensorflow_model_server
2. tools/run_in_docker.sh bazel build --color=yes --curses=yes --config=nativeopt --verbose_failures tensorflow_serving/model_servers:tensorflow_model_server
3. tools/run_in_docker.sh bazel build --color=yes --curses=yes --config=mkl --verbose_failures tensorflow_serving/model_servers:tensorflow_model_server
Build complete,Copy the tensorflow_model_server bin-file and(libiomp5.so、libmklml_gnu.so、libmklml_intel.so when build with mkl) build a docker images.Deploy pods in kubernets with same quota(cpu:8core mem:8G). Start the serving with:
tensorflow_model_server --port=8500 --rest_api_port=8501 --tensorflow_intra_op_parallelism=8 --tensorflow_inter_op_parallelism=8 --model_name=resnet --model_base_path=/models/resnet
waiting serving up,use restful client request(a cat.jpg) the serving.
request result:
I think the mkl-option will speed up inference,But the result is different from what I think.Even though I checked a lot of information by Google search,But did not solve this problem.I noticed that when I use mkl(MKLDNN_VERBOSE=1), Pod output a lot of logs
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_blocked out:f32_blocked,num:1,224x224x3,0.0720215
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nhwc out:f32_nchw,num:1,1x3x224x224,0.269775
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_Ohwi8o,num:1,64x3x7x7,0.0148926
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nchw fwei:Ohwi8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3,4.63403
mkldnn_verbose,exec,pooling,jit:avx,forward_training,fdata:nChw8c fws:nChw8c,alg:pooling_max,mb1ic64_ih112oh56kh3sh2ph0_iw112ow56kw3sw2pw0,6.73706
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic64ih56iw56,0.13208
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic64ih56iw56,0.13208
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,64x64x1x1,0.00805664
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x64x1x1,0.0219727
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,0.520996
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic64ih56iw56,0.100098
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic64ih56iw56,0.156982
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,64x64x3x3,0.046875
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,1.71704
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1,4.39502
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic64ih56iw56,0.0849609
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic64ih56iw56,0.130127
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x64x1x1,0.0161133
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,2.07715
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih56iw56,0.447998
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih56iw56,0.572998
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,64x256x1x1,0.0280762
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,2.03296
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic64ih56iw56,0.12207
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic64ih56iw56,0.123047
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,64x64x3x3,0.0371094
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1,4.16797
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic64ih56iw56,0.110107
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic64ih56iw56,0.122803
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x64x1x1,0.0161133
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,1.90698
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih56iw56,0.447998
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih56iw56,0.519043
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,64x256x1x1,0.0249023
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,1.927
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic64ih56iw56,0.107178
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic64ih56iw56,0.121826
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,64x64x3x3,0.0358887
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc64_ih56oh56kh3sh1dh0ph1_iw56ow56kw3sw1dw0pw1,3.94092
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic64ih56iw56,0.11792
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic64ih56iw56,0.12207
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x64x1x1,0.0158691
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic64oc256_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,1.80786
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih56iw56,0.429932
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih56iw56,0.564941
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x256x1x1,0.0490723
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x256x1x1,0.219971
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc512_ih56oh28kh1sh2dh0ph0_iw56ow28kw1sw2dw0pw0,3.58984
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc128_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0,3.93701
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih56iw56,0.189209
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih56iw56,0.24585
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x128x3x3,0.229004
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc128_ih56oh28kh3sh2dh0ph1_iw56ow28kw3sw2dw0pw1,4.39185
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih28iw28,0.0539551
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih28iw28,0.0600586
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x128x1x1,0.078125
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc512_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,1.68311
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih28iw28,0.238037
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih28iw28,0.243164
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x512x1x1,0.0751953
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc128_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,1.78296
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih28iw28,0.0539551
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih28iw28,0.0610352
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x128x3x3,0.170898
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc128_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,3.85205
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih28iw28,0.0539551
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih28iw28,0.0639648
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x128x1x1,0.0700684
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc512_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,1.49487
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih28iw28,0.208008
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih28iw28,0.259033
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x512x1x1,0.0710449
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc128_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,1.52783
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih28iw28,0.0500488
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih28iw28,0.0559082
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x128x3x3,0.169922
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc128_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,3.27295
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih28iw28,0.0500488
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih28iw28,0.0568848
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x128x1x1,0.064209
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc512_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,1.42407
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih28iw28,0.201904
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih28iw28,0.24292
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x512x1x1,0.0651855
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc128_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,1.55005
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih28iw28,0.0498047
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih28iw28,0.0568848
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,128x128x3x3,0.1521
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc128_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,3.25903
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic128ih28iw28,0.0498047
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic128ih28iw28,0.0571289
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x128x1x1,0.0649414
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic128oc512_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,1.40894
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih28iw28,0.199951
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih28iw28,0.231934
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x512x1x1,0.1521
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,1024x512x1x1,0.943848
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc256_ih28oh28kh1sh1dh0ph0_iw28ow28kw1sw1dw0pw0,2.91895
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih28iw28,0.101074
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih28iw28,0.13501
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc1024_ih28oh14kh1sh2dh0ph0_iw28ow14kw1sw2dw0pw0,3.16309
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x256x3x3,0.89502
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc256_ih28oh14kh3sh2dh0ph1_iw28ow14kw3sw2dw0pw1,3.64014
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.027832
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0339355
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,1024x256x1x1,0.400879
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc1024_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.42798
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic1024ih14iw14,0.106201
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic1024ih14iw14,0.134033
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x1024x1x1,0.296143
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic1024oc256_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.44702
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.0258789
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0361328
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x256x3x3,0.726807
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc256_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1,3.52002
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.026123
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0280762
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,1024x256x1x1,0.299072
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc1024_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.42407
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic1024ih14iw14,0.115967
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic1024ih14iw14,0.12085
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x1024x1x1,0.280029
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic1024oc256_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.46411
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.0249023
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0290527
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x256x3x3,0.728027
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc256_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1,3.55688
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.0319824
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0288086
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,1024x256x1x1,0.287109
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc1024_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.41504
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic1024ih14iw14,0.125
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic1024ih14iw14,0.11499
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x1024x1x1,0.273926
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic1024oc256_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.44897
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.026123
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0288086
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x256x3x3,0.704102
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc256_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1,3.4751
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.0258789
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.027832
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,1024x256x1x1,0.287109
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc1024_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.43506
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic1024ih14iw14,0.11499
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic1024ih14iw14,0.114014
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x1024x1x1,0.322021
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic1024oc256_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.48804
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.0251465
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0280762
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x256x3x3,0.673828
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc256_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1,3.57983
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.0249023
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0300293
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,1024x256x1x1,0.282959
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc1024_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.41992
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic1024ih14iw14,0.105957
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic1024ih14iw14,0.11499
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x1024x1x1,0.281006
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic1024oc256_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.45117
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.026123
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0341797
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,256x256x3x3,0.624023
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc256_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1,3.59302
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic256ih14iw14,0.0219727
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic256ih14iw14,0.0290527
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,1024x256x1x1,0.344971
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic256oc1024_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,1.43481
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic1024ih14iw14,0.105957
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic1024ih14iw14,0.115967
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x1024x1x1,0.675049
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic1024oc512_ih14oh14kh1sh1dh0ph0_iw14ow14kw1sw1dw0pw0,2.96313
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih14iw14,0.052002
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,2048x1024x1x1,3.70703
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih14iw14,0.0600586
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic1024oc2048_ih14oh7kh1sh2dh0ph0_iw14ow7kw1sw2dw0pw0,3.59106
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x512x3x3,4.06909
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc512_ih14oh7kh3sh2dh0ph1_iw14ow7kw3sw2dw0pw1,4.45605
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih7iw7,0.0119629
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih7iw7,0.0161133
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,2048x512x1x1,1.95117
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc2048_ih7oh7kh1sh1dh0ph0_iw7ow7kw1sw1dw0pw0,1.6189
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic2048ih7iw7,0.0688477
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic2048ih7iw7,0.0600586
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x2048x1x1,1.40894
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic2048oc512_ih7oh7kh1sh1dh0ph0_iw7ow7kw1sw1dw0pw0,1.59009
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih7iw7,0.013916
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih7iw7,0.0151367
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x512x3x3,2.85815
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc512_ih7oh7kh3sh1dh0ph1_iw7ow7kw3sw1dw0pw1,4.11279
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih7iw7,0.013916
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih7iw7,0.0151367
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,2048x512x1x1,1.16504
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc2048_ih7oh7kh1sh1dh0ph0_iw7ow7kw1sw1dw0pw0,1.53711
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic2048ih7iw7,0.0620117
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic2048ih7iw7,0.0571289
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x2048x1x1,1.30005
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic2048oc512_ih7oh7kh1sh1dh0ph0_iw7ow7kw1sw1dw0pw0,1.56519
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih7iw7,0.012207
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih7iw7,0.0151367
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,512x512x3x3,2.95093
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc512_ih7oh7kh3sh1dh0ph1_iw7ow7kw3sw1dw0pw1,4.01489
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic512ih7iw7,0.0119629
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic512ih7iw7,0.0148926
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_hwio out:f32_OIhw8i8o,num:1,2048x512x1x1,1.17505
mkldnn_verbose,exec,convolution,jit_1x1:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:undef fdst:nChw8c,alg:convolution_direct,mb1_g1ic512oc2048_ih7oh7kh1sh1dh0ph0_iw7ow7kw1sw1dw0pw0,1.54395
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:3,mb1ic2048ih7iw7,0.0588379
mkldnn_verbose,exec,eltwise,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,alg:eltwise_relu,mb1ic2048ih7iw7,0.072998
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw8c out:f32_nhwc,num:1,1x2048x7x7,0.0410156
mkldnn_verbose,create,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1ic1001ih0iw1476395008,0.00415039
mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1ic1001ih0iw1476395008,0.0109863
thanks again!
I encounter the same problem. Using tf_serving built with MKL(--config=mkl) consumes much more time (3x ~ 4x) than without it.
Other common build argument is:
bazel build -c opt --copt="-DEIGEN_USE_VML" --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.2 tensorflow_serving/...
CPU: Xeon(R) E5-2630 v4
@karthikvadla any idea?
(sorry for the driveby comment and I ask the experts on this thread to validate my thoughts)
I think the first question is "Why do you expect to see performance improvement with mkl? Whether this is to be expected depends not only on your processor, but also what kind of load your model is under, how variable that load is, how many concurrent sessions you're running, etc.
In addition, if I understand correctly, you're feeding batch sizes of 1, so I wouldn't expect a latency reduction as there isn't much work to vectorize, and the eigen implementation is already pretty optimal.
Another point worth noting is that as far as I know, mkl uses openmp for thread management which is orthogonal to TF's thread management. This is fixed in the new TF-hybrid implementation (which essentially gives you the best of mkl without openmp and tf stepping on each others' toes..) - this is enabled by default here as part of 2.0. Feel free to give it a spin (with larger batch sizes) and report your findings.
Thanks
@unclepeddy
hello,
I am confused about your comment "Whether this is to be expected depends not only on your processor, but also what kind of load your model is under, how variable that load is, how many concurrent sessions you're running, etc."
Does a running session process one request at the same time? if I want to run N concurrent requests, does it mean that i can do as fast as one request by change "tensorflow_session_parallem=N".
By the way, could you please tell the details about the parameter of OMP_NUM_THREADS and tensorflow_session_parallem and tensorflow_intra_op_parallelism, tensorflow_inter_op_parallelism, the latter two parameters cannot be set at the same time in tf1.13.
Thanks
Yes - a Session.Run call is blocking. In order to process examples concurrently, you have to either batch examples on client side before you send them or enable server-side batching.
To enable server-side batching, set --enable_batching and --batching_parameters_file flags. Look at SessionBundleConfig for the batching params that can be defined.
Regarding your questions about the different flags - before this PR tensorflow_session_parallesim implicitly set both tensorflow_inter_op_parallelism and tensorflow_intra_op_parallelism that define the behavior of the TF Session runtime. AFAIK this PR made it into 1.13 so now, you should be able to set intra and inter parallelism separately. If this doesn't work in TFS 1.13, I believe that's a bug and should be filed appropriately.
Regarding what these flags do:
tensorflow_inter_op_parallelism limits the maximum number of concurrent TF ops executing
tensorflow_intra_op_parallelism is the size of the global threadpool that _native TF ops_ use to execute their work.
However, as I mentioned in my above comment, MKL ops execution is managed via OpenMP, orthogonally to the above. All ops are launched with a pre-determined # of threads, set by OMP_NUM_THREADS (which I believe in TF-MKL implementation is set to tensorflow_intra_op_parallelism).
This is why I mention that it's not guaranteed that you'll see better performance by using mkl - TF's thread management isn't aware of OpenMP's and so it's easy to cause inadvertent thread migration, cache thrashing, etc.
Hope this helps
thank you very much for your reply,
I have test the parameters, when i set tensorflow_session_parallesim=10, I can do 8 concurrent requests as fast as one just on tfs-cpu.
when I open mkl, I set OMP_NUM_THREADS=20,and --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=1, I can get the best performance just one concurrent requests, but,when I increase the concurrent requests the time cost also increases.
when i set OMP_NUM_THREADS=4, and --tensorflow_intra_op_parallelism=4 --tensorflow_inter_op_parallelism=4, the concurrent requests have better performance but one request slower.
batch 1 2 4 8 10
time(ms) 16 14 15 26 34
by the way the best performance just one concurrent requests by OMP_NUM_THREADS=20,and --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=1 is 12ms
Is there possible to increase the number of concurrent requests? I have changed the parameter to other value, none useful, the cpu is Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Socket(s): 2 Core(s) per socket:10 but when I do 10 concurrent requests, the cpu just used about 20 cores with utilized 80%~95%. but continue to increase the number of concurrent time cost significantly increased.
Longer latency is expected. As @unclepeddy explained, MKL uses OpenMP whose threading settings is not as dynamic. Intel-optimized TensorFlow (I'll call it TF-MKL from here on) can give better throughput than normal TF with the right settings for a rather static workload, but it can also have much worse latency because OpenMP threading and TF threading don't know about each other. It's also hard to make TF Serving with TF-MKL work well in dynamic workload because OpenMP's num threads is only set once as a global setting. The Intel team is aware of this downside and is looking into making MKL-DNN accept an Eigen threadpool (which is the base of TF threadpool) as an alternative to OpenMP. (Won't happen in the near future, though.)
More in-depth threading explanation:
Normal TensorFlow has two thread pools: inter-op (how many ops can be run at the same time) and intra-op (the threads the ops will use to execute). Concurrent ops share/split the threads in the intra-op thread pool, so there can be at most #inter + #intra threads at any point in time. TensorFlow can run fewer than #inter ops at the same time, and in practice I usually saw 3-4 (or sometimes 10) ops running at the same time. Setting both #inter and #intra to #cores generally works pretty well.
TF-MKL is different. Each MKL op uses OMP_NUM_THREADS threads. The OpenMP thread pool is a separate thread pool from the inter- and intra-thread pools. So, in the worse case there can be O(OMP_NUM_THREADS * #inter + #intra) threads running. Multiply that by #concurrent_tf_serving_sessions if you run multiple sessions. This can lead to major oversubscription of cores. Therefore, we don't expect TF-MKL to outperform normal TF in TF Serving settings.
Most helpful comment
Longer latency is expected. As @unclepeddy explained, MKL uses OpenMP whose threading settings is not as dynamic. Intel-optimized TensorFlow (I'll call it TF-MKL from here on) can give better throughput than normal TF with the right settings for a rather static workload, but it can also have much worse latency because OpenMP threading and TF threading don't know about each other. It's also hard to make TF Serving with TF-MKL work well in dynamic workload because OpenMP's num threads is only set once as a global setting. The Intel team is aware of this downside and is looking into making MKL-DNN accept an Eigen threadpool (which is the base of TF threadpool) as an alternative to OpenMP. (Won't happen in the near future, though.)
More in-depth threading explanation:
Normal TensorFlow has two thread pools: inter-op (how many ops can be run at the same time) and intra-op (the threads the ops will use to execute). Concurrent ops share/split the threads in the intra-op thread pool, so there can be at most #inter + #intra threads at any point in time. TensorFlow can run fewer than #inter ops at the same time, and in practice I usually saw 3-4 (or sometimes 10) ops running at the same time. Setting both #inter and #intra to #cores generally works pretty well.
TF-MKL is different. Each MKL op uses OMP_NUM_THREADS threads. The OpenMP thread pool is a separate thread pool from the inter- and intra-thread pools. So, in the worse case there can be O(OMP_NUM_THREADS * #inter + #intra) threads running. Multiply that by #concurrent_tf_serving_sessions if you run multiple sessions. This can lead to major oversubscription of cores. Therefore, we don't expect TF-MKL to outperform normal TF in TF Serving settings.