Incubator-mxnet: MXNet much slower than TensorFlow

Created on 11 Apr 2018 · 7Comments · Source: apache/incubator-mxnet

I might be doing this comparison wrong - please let me know! Most likely I have written a bug that causes the slowdown.

I implemented a variational autoencoder in MXNet using the Gluon library, and hybridized it. Here is the code: https://gist.github.com/altosaar/6c153e9ebd89a4b8ef6a638ed1520de4

I also implemented it in TensorFlow: https://github.com/altosaar/variational-autoencoder

However, it is orders of magnitude slower in MXNet than in TensorFlow. I made sure to match the hyperparameters and checked that I get the same results in both frameworks, so I don't think there is a bug in terms of the math (both implementations get to a training ELBO of around -100 on the binary MNIST dataset in a few thousand iterations).

I'm using MXNet 1.1.0 with a P100 GPU, and TensorFlow 1.6.0.

Here is the timing information:

MXNet:

With GPU:

$ python variational_autoencoder_gluon.py
Iter 1000       ELBO: -144.5    speed: 3.567e-03 s/iter
Iter 2000       ELBO: -118.3    speed: 3.686e-03 s/iter

Without GPU:

$ python variational_autoencoder_gluon.py
Iter 1000       ELBO: -143.0    speed: 1.171e-02 s/iter
Iter 2000       ELBO: -121.2    speed: 1.192e-02 s/iter

TensorFlow:

On CPU:

$ python vae.py
Iteration: 1000 ELBO: -137.405 sec/iter: 1.878e-06
Iteration: 2000 ELBO: -125.329 sec/iter: 1.929e-06

Performance

Source

altosaar

Most helpful comment

You have an error in your tensorflow code @altosaar
You are setting t0 = time.time() right before computing: (time.time() - t0)
Hence the number through the roof for tensorflow (0.5M iter/sec on CPU should have startled you 😄 )

After fixing that, using your benchmark and rewriting the metrics, MXNet is twice faster 🎉 :

MXNet:

Iter 11000  ELBO: -102.9     Examples/s: 24981.99
Iter 12000  ELBO: -104.8     Examples/s: 26717.71

Tensorflow:

Iteration: 10000 ELBO: -96.456 Examples/s: 10878.597
Iteration: 11000 ELBO: -103.466 Examples/s: 10898.741

As an additional advice, always use speed metric that are easy to comprehend, Example/sec is a good one. sec/iter not so much. Otherwise you would have noticed faster that 1.929e-06sec/iter (33M image/sec) was the abnormal one 😃

ThomasDelteil on 12 Apr 2018

👍3 🎉1

All 7 comments

Hello @altosaar, thanks for your benchmark. Could you please add your compile configuration?

marcoabreu on 12 Apr 2018

@eric-haibin-lin

marcoabreu on 12 Apr 2018

@altosaar Did you have a chance to run the code with mxnet profiler and see which operator is the bottleneck? https://github.com/apache/incubator-mxnet/blob/master/docs/faq/perf.md#profiler

eric-haibin-lin on 12 Apr 2018

@altosaar you can try CPU with MKL-DNN backend by lastest master branch.
I think it will be much faster.

https://github.com/apache/incubator-mxnet/blob/master/docs/faq/perf.md

For using Intel Xeon CPUs for training and inference, we suggest enabling USE_MKLDNN = 1 inconfig.mk.
We also find that setting the following two environment variables can help:
export KMP_AFFINITY=granularity=fine,compact,1,0 if there are two physical CPUs

pengzhao-intel on 12 Apr 2018

MNIST is too small to bench. IO is the main bottleneck.

chinakook on 12 Apr 2018

After fixing that, using your benchmark and rewriting the metrics, MXNet is twice faster 🎉 :

MXNet:

Iter 11000  ELBO: -102.9     Examples/s: 24981.99
Iter 12000  ELBO: -104.8     Examples/s: 26717.71

Tensorflow:

Iteration: 10000 ELBO: -96.456 Examples/s: 10878.597
Iteration: 11000 ELBO: -103.466 Examples/s: 10898.741

ThomasDelteil on 12 Apr 2018

👍3 🎉1

Dang, I knew it was a silly bug on my end–thanks for catching that @ThomasDelteil :) I just pushed the fix. You're right, should have caught that by realizing that millions of iterations/s is very unreasonable.

Here are the new timings I get:

TensorFlow 1.7.0 CPU:

Iteration: 1000 ELBO: -131.288 s/iter: 5.380e-03
Iteration: 2000 ELBO: -122.167 s/iter: 5.253e-03

TensorFlow 1.7.0 GPU:

Iteration: 1000 ELBO: -142.142 s/iter: 3.681e-03
Iteration: 2000 ELBO: -114.007 s/iter: 3.725e-03

This matches the MXNet timings on GPU 👍 :) and it's awesome that it's a lot faster on CPU!

P.S. I agree examples/s is good in some cases. For generative models, I find time/iteration more informative (the convergence of the objective should be measured in number of parameter updates, not epochs, so this is what I focus on).

altosaar on 12 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings