I am also having performance issues with the BatchNormalization layer.
I created a small code that can reproduce my issue (and included the output) here: https://gist.github.com/ma1112/8c118d8584da9eb5637053193790bb47
In short, my (regression) network is [Conv, BatchNorm, Activation, MaxPool] x 6, Flatten, Dense, BatchNorm, Activation, Dense
Omitting the BatchNorm layers after the Conv layers result in a 7x time faster training [148 sec instead of 1102 sec per epoch] and 52x faster prediction [2.2sec instead of 120 sec] on a Tesla K80.
Me too.Finally someone talks about it. When I add bn layers, the performance goes down about 2x times.The networks with BN layers are too slow, maybe you should do something to make it better.
Maybe this is the reason for the slowdown.
After reading through the comments from the last week, I finally figured out myself what my problem was with my code above.
It seems, that tensorflow really needs the data_format to be set to channels_last. In my example code above, I used the channels_first setting. After switching to channels_last [and modifying the shape of the input array, axis of the BN layer], the training time deceased significantly [28 sec from 157 sec per epoch] even without the BN layers. Moreover, adding the BN layers have minimal effect now wrt the training time.[37 sec instead of 28, with 6 added BN layers]
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Most helpful comment
I am also having performance issues with the BatchNormalization layer.
I created a small code that can reproduce my issue (and included the output) here: https://gist.github.com/ma1112/8c118d8584da9eb5637053193790bb47
In short, my (regression) network is [Conv, BatchNorm, Activation, MaxPool] x 6, Flatten, Dense, BatchNorm, Activation, Dense
Omitting the BatchNorm layers after the Conv layers result in a 7x time faster training [148 sec instead of 1102 sec per epoch] and 52x faster prediction [2.2sec instead of 120 sec] on a Tesla K80.