Glow: [model zoo] Different results running ResNet50 models

Created on 9 Jan 2019  路  3Comments  路  Source: pytorch/glow

In provided examples there are 2 ResNet50 available. First one (let's call it RN1) is Caffe2 model from fb-glow-assets.s3.amazonaws.com/models/ and second one (RN2) is ONNX model from s3.amazonaws.com/download.onnx/models/opset_6/

What I noticed is that when running batch size bigger than 1 RN2 is returning only first result correctly and rest has random values. The problem can be spotted on the low-level IR level.

For RN1 in declare:

%gpu_0_data = WeightVar float<16 x 3 x 224 x 224> mutable // size: 9633792 // Users: @in 1
%save_gpu_0_softmax = WeightVar float<16 x 1000> mutable // size: 64000 // Users: @out 237 

For RN2 in declare:

%gpu_0_data_0 = WeightVar float<16 x 3 x 224 x 224> mutable // size: 9633792 // Users: @in 1
%save_gpu_0_softmax_1 = WeightVar float<1 x 1000> mutable // size: 4000 // Users: @out 243

And the results I have are following:
RN1:
./bin/image-classifier tests/images/imagenet/16/*.png -use-imagenet-normalization -image-mode=0to1 -m=resnet50 -model-input-name=gpu_0/data -cpu -dump-ir

Model: resnet50
 File: tests/images/imagenet/16/cat_285_1.png   Label-K1: 281 (probability: 0.7190)
 File: tests/images/imagenet/16/cat_285_2.png   Label-K1: 281 (probability: 0.7190)
 File: tests/images/imagenet/16/cat_285_4.png   Label-K1: 281 (probability: 0.7190)
 File: tests/images/imagenet/16/cat_285_5.png   Label-K1: 281 (probability: 0.7190)
 File: tests/images/imagenet/16/cat_285.png     Label-K1: 281 (probability: 0.7190)
 File: tests/images/imagenet/16/dog_207_1.png   Label-K1: 207 (probability: 0.9446)
 File: tests/images/imagenet/16/dog_207_2.png   Label-K1: 207 (probability: 0.9446)
 File: tests/images/imagenet/16/dog_207_4.png   Label-K1: 207 (probability: 0.9446)
 File: tests/images/imagenet/16/dog_207_5.png   Label-K1: 207 (probability: 0.9446)
 File: tests/images/imagenet/16/dog_207.png     Label-K1: 207 (probability: 0.9446)
 File: tests/images/imagenet/16/zebra_340_1.png Label-K1: 340 (probability: 0.9984)
 File: tests/images/imagenet/16/zebra_340_2.png Label-K1: 340 (probability: 0.9984)
 File: tests/images/imagenet/16/zebra_340_3.png Label-K1: 340 (probability: 0.9984)
 File: tests/images/imagenet/16/zebra_340_4.png Label-K1: 340 (probability: 0.9984)
 File: tests/images/imagenet/16/zebra_340_5.png Label-K1: 340 (probability: 0.9984)
 File: tests/images/imagenet/16/zebra_340.png   Label-K1: 340 (probability: 0.9984)

RN2:
./bin/image-classifier tests/images/imagenet/16/*.png -use-imagenet-normalization -image-mode=0to1 -m=onnx_models/resnet50/model.onnx -model-input-name=gpu_0/data_0 -cpu -dump-ir

Model: onnx_models/resnet50/model.onnx
 File: tests/images/imagenet/16/cat_285_1.png   Label-K1: 281 (probability: 0.7190)
 File: tests/images/imagenet/16/cat_285_2.png   Label-K1: 803 (probability: 109501926753866426517948838075957248.0000)
 File: tests/images/imagenet/16/cat_285_4.png   Label-K1: 8 (probability: 1163137744525318537797146186350592.0000)
 File: tests/images/imagenet/16/cat_285_5.png   Label-K1: 163 (probability: 200735535189990277129236968606233788416.0000)
 File: tests/images/imagenet/16/cat_285.png     Label-K1: 140 (probability: 73989688674039732580357437915136.0000)
 File: tests/images/imagenet/16/dog_207_1.png   Label-K1: 492 (probability: 1198804886080946637596203847516160.0000)
 File: tests/images/imagenet/16/dog_207_2.png   Label-K1: 200 (probability: 76717526031150600707147634076286976.0000)
 File: tests/images/imagenet/16/dog_207_4.png   Label-K1: 444 (probability: 1163137744525318537797146186350592.0000)
 File: tests/images/imagenet/16/dog_207_5.png   Label-K1: 60 (probability: 76717526031150600707147634076286976.0000)
 File: tests/images/imagenet/16/dog_207.png     Label-K1: 308 (probability: 76717526031150600707147634076286976.0000)
 File: tests/images/imagenet/16/zebra_340_1.png Label-K1: 674 (probability: 1209009612681038303883533234470912.0000)
 File: tests/images/imagenet/16/zebra_340_2.png Label-K1: 212 (probability: 1163137744525318537797146186350592.0000)
 File: tests/images/imagenet/16/zebra_340_3.png Label-K1: 760 (probability: 1163137744525318537797146186350592.0000)
 File: tests/images/imagenet/16/zebra_340_4.png Label-K1: 679 (probability: 77457197409704310398483475464192.0000)
 File: tests/images/imagenet/16/zebra_340_5.png Label-K1: 890 (probability: 5948536641811781986122581567044845568.0000)
 File: tests/images/imagenet/16/zebra_340.png   Label-K1: 392 (probability: 1163137744525318537797146186350592.0000)

Looking at the dumped IR I can see that for some reason in RN2 Glow adds at the very end extra tensorview instruction which is effectively reducing batch size from 16 to 1.

232 %gpu_0_pool5_12_res = allocactivation  { Ty: float<16 x 2048 x 1 x 1>} // size: 131072 // Users: @out 238, @in 235, @out 233
233 %gpu_0_pool5_12 = transpose @out %gpu_0_pool5_12_res, @in %gpu_0_pool5_11_res { Shuffle: [0, 3, 1, 2]}
234 %dealloc121 = deallocactivation @out %gpu_0_pool5_11_res // size: 131072
235 %tensorview_reshape = tensorview @in %gpu_0_pool5_12_res { Ty: float<1 x 2048>, Offsets: [0, 0, 0, 0]} // Users: @in 237
236 %copy_reshape_res = allocactivation  { Ty: float<1 x 2048>} // size: 8192 // Users: @out 241, @in 240, @out 237
237 %copy_reshape = copy @out %copy_reshape_res, @in %tensorview_reshape
238 %dealloc122 = deallocactivation @out %gpu_0_pool5_12_res // size: 131072
239 %gpu_0_pred_11_res = allocactivation  { Ty: float<1 x 1000>} // size: 4000 // Users: @in 243, @out 242, @out 244, @in 242, @out 240
240 %gpu_0_pred_11 = matmul @out %gpu_0_pred_11_res, @in %copy_reshape_res, @in %gpu_0_pred_w_01
241 %dealloc123 = deallocactivation @out %copy_reshape_res // size: 8192
242 %gpu_0_pred_12 = elementadd @out %gpu_0_pred_11_res, @in %gpu_0_pred_11_res, @in %gpu_0_pred_b_01
243 %gpu_0_softmax_1 = softmax @out %save_gpu_0_softmax_1, @in %gpu_0_pred_11_res
244 %dealloc124 = deallocactivation @out %gpu_0_pred_11_res // size: 4000

Most helpful comment

Now that https://github.com/pytorch/glow/pull/2248 has landed, you should see an error when you try to run models with batch sizes that are unsupported instead of seeing random results. Sorry for the confusion, @speryt!

All 3 comments

Hi @speryt, the reason this occurs is that the ONNX model is not designed to allow for dynamic batch sizes -- it has an explicit reshape built in to the model which expects batch size == 1. If you run the ONNX Resnet50 with asserts turned on and a batch size > 1 then you will hit this assertion failure in createReshape():

https://github.com/pytorch/glow/blob/4cb8e893e2fdfca84e1706a140d29b3ef894d065/lib/Graph/Graph.cpp#L703-L704

You can see in our run.sh that for the ONNX models we run each inference on one image at a time.

This definitely does represent a bug -- we shouldn't be able to run the model in release mode without throwing some kind of error. I'm not sure why the verifier is not catching this.

Now that https://github.com/pytorch/glow/pull/2248 has landed, you should see an error when you try to run models with batch sizes that are unsupported instead of seeing random results. Sorry for the confusion, @speryt!

Thanks @jfix71 ! It works now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mciprian13 picture mciprian13  路  3Comments

QiJune picture QiJune  路  5Comments

artemrakhov-glow picture artemrakhov-glow  路  4Comments

rdzhabarov picture rdzhabarov  路  4Comments

georgeokelly picture georgeokelly  路  4Comments