I might be doing this comparison wrong - please let me know! Most likely I have written a bug that causes the slowdown.
I implemented a variational autoencoder in MXNet using the Gluon library, and hybridized it. Here is the code: https://gist.github.com/altosaar/6c153e9ebd89a4b8ef6a638ed1520de4
I also implemented it in TensorFlow: https://github.com/altosaar/variational-autoencoder
However, it is orders of magnitude slower in MXNet than in TensorFlow. I made sure to match the hyperparameters and checked that I get the same results in both frameworks, so I don't think there is a bug in terms of the math (both implementations get to a training ELBO of around -100 on the binary MNIST dataset in a few thousand iterations).
I'm using MXNet 1.1.0 with a P100 GPU, and TensorFlow 1.6.0.
Here is the timing information:
MXNet:
With GPU:
$ python variational_autoencoder_gluon.py
Iter 1000 ELBO: -144.5 speed: 3.567e-03 s/iter
Iter 2000 ELBO: -118.3 speed: 3.686e-03 s/iter
Without GPU:
$ python variational_autoencoder_gluon.py
Iter 1000 ELBO: -143.0 speed: 1.171e-02 s/iter
Iter 2000 ELBO: -121.2 speed: 1.192e-02 s/iter
TensorFlow:
On CPU:
$ python vae.py
Iteration: 1000 ELBO: -137.405 sec/iter: 1.878e-06
Iteration: 2000 ELBO: -125.329 sec/iter: 1.929e-06
Hello @altosaar, thanks for your benchmark. Could you please add your compile configuration?
@eric-haibin-lin
@altosaar Did you have a chance to run the code with mxnet profiler and see which operator is the bottleneck? https://github.com/apache/incubator-mxnet/blob/master/docs/faq/perf.md#profiler
@altosaar you can try CPU with MKL-DNN backend by lastest master branch.
I think it will be much faster.
https://github.com/apache/incubator-mxnet/blob/master/docs/faq/perf.md
For using Intel Xeon CPUs for training and inference, we suggest enabling USE_MKLDNN = 1 inconfig.mk.
We also find that setting the following two environment variables can help:
export KMP_AFFINITY=granularity=fine,compact,1,0 if there are two physical CPUs
MNIST is too small to bench. IO is the main bottleneck.
You have an error in your tensorflow code @altosaar
You are setting t0 = time.time() right before computing: (time.time() - t0)
Hence the number through the roof for tensorflow (0.5M iter/sec on CPU should have startled you 馃槃 )
After fixing that, using your benchmark and rewriting the metrics, MXNet is twice faster 馃帀 :
MXNet:
Iter 11000 ELBO: -102.9 Examples/s: 24981.99
Iter 12000 ELBO: -104.8 Examples/s: 26717.71
Tensorflow:
Iteration: 10000 ELBO: -96.456 Examples/s: 10878.597
Iteration: 11000 ELBO: -103.466 Examples/s: 10898.741
As an additional advice, always use speed metric that are easy to comprehend, Example/sec is a good one. sec/iter not so much. Otherwise you would have noticed faster that 1.929e-06sec/iter (33M image/sec) was the abnormal one 馃槂
Dang, I knew it was a silly bug on my end鈥搕hanks for catching that @ThomasDelteil :) I just pushed the fix. You're right, should have caught that by realizing that millions of iterations/s is very unreasonable.
Here are the new timings I get:
TensorFlow 1.7.0 CPU:
Iteration: 1000 ELBO: -131.288 s/iter: 5.380e-03
Iteration: 2000 ELBO: -122.167 s/iter: 5.253e-03
TensorFlow 1.7.0 GPU:
Iteration: 1000 ELBO: -142.142 s/iter: 3.681e-03
Iteration: 2000 ELBO: -114.007 s/iter: 3.725e-03
This matches the MXNet timings on GPU 馃憤 :) and it's awesome that it's a lot faster on CPU!
P.S. I agree examples/s is good in some cases. For generative models, I find time/iteration more informative (the convergence of the objective should be measured in number of parameter updates, not epochs, so this is what I focus on).
Most helpful comment
You have an error in your tensorflow code @altosaar
You are setting
t0 = time.time()right before computing:(time.time() - t0)Hence the number through the roof for tensorflow (0.5M iter/sec on CPU should have startled you 馃槃 )
After fixing that, using your benchmark and rewriting the metrics, MXNet is twice faster 馃帀 :
MXNet:
Tensorflow:
As an additional advice, always use speed metric that are easy to comprehend, Example/sec is a good one. sec/iter not so much. Otherwise you would have noticed faster that 1.929e-06sec/iter (33M image/sec) was the abnormal one 馃槂