I don't think you can take the weights from the model without batch norm and stick them in the model with batch normalization. The activation statistics will almost certainly be wrong at each batch-normalized layer if someone tries fine-tuning the model or otherwise continue training.
Could we then also offer pretained weights for the batch normalized versions?
Depending on how you obtain them anyway I could also look if I get a caffemodel converted to pytorch state dict.
+1 for the request.
Most helpful comment
Could we then also offer pretained weights for the batch normalized versions?
Depending on how you obtain them anyway I could also look if I get a caffemodel converted to pytorch state dict.