Incubator-mxnet: MXNet much slower than TensorFlow

Created on 11 Apr 2018  路  7Comments  路  Source: apache/incubator-mxnet

I might be doing this comparison wrong - please let me know! Most likely I have written a bug that causes the slowdown.

I implemented a variational autoencoder in MXNet using the Gluon library, and hybridized it. Here is the code: https://gist.github.com/altosaar/6c153e9ebd89a4b8ef6a638ed1520de4

I also implemented it in TensorFlow: https://github.com/altosaar/variational-autoencoder

However, it is orders of magnitude slower in MXNet than in TensorFlow. I made sure to match the hyperparameters and checked that I get the same results in both frameworks, so I don't think there is a bug in terms of the math (both implementations get to a training ELBO of around -100 on the binary MNIST dataset in a few thousand iterations).

I'm using MXNet 1.1.0 with a P100 GPU, and TensorFlow 1.6.0.

Here is the timing information:

MXNet:

With GPU:

$ python variational_autoencoder_gluon.py
Iter 1000       ELBO: -144.5    speed: 3.567e-03 s/iter
Iter 2000       ELBO: -118.3    speed: 3.686e-03 s/iter

Without GPU:

$ python variational_autoencoder_gluon.py
Iter 1000       ELBO: -143.0    speed: 1.171e-02 s/iter
Iter 2000       ELBO: -121.2    speed: 1.192e-02 s/iter

TensorFlow:

On CPU:

$ python vae.py
Iteration: 1000 ELBO: -137.405 sec/iter: 1.878e-06
Iteration: 2000 ELBO: -125.329 sec/iter: 1.929e-06
Performance

Most helpful comment

You have an error in your tensorflow code @altosaar
You are setting t0 = time.time() right before computing: (time.time() - t0)
Hence the number through the roof for tensorflow (0.5M iter/sec on CPU should have startled you 馃槃 )

After fixing that, using your benchmark and rewriting the metrics, MXNet is twice faster 馃帀 :

MXNet:

Iter 11000  ELBO: -102.9     Examples/s: 24981.99
Iter 12000  ELBO: -104.8     Examples/s: 26717.71

Tensorflow:

Iteration: 10000 ELBO: -96.456 Examples/s: 10878.597
Iteration: 11000 ELBO: -103.466 Examples/s: 10898.741

As an additional advice, always use speed metric that are easy to comprehend, Example/sec is a good one. sec/iter not so much. Otherwise you would have noticed faster that 1.929e-06sec/iter (33M image/sec) was the abnormal one 馃槂

All 7 comments

Hello @altosaar, thanks for your benchmark. Could you please add your compile configuration?

@eric-haibin-lin

@altosaar Did you have a chance to run the code with mxnet profiler and see which operator is the bottleneck? https://github.com/apache/incubator-mxnet/blob/master/docs/faq/perf.md#profiler

@altosaar you can try CPU with MKL-DNN backend by lastest master branch.
I think it will be much faster.

https://github.com/apache/incubator-mxnet/blob/master/docs/faq/perf.md

For using Intel Xeon CPUs for training and inference, we suggest enabling USE_MKLDNN = 1 inconfig.mk.
We also find that setting the following two environment variables can help:
export KMP_AFFINITY=granularity=fine,compact,1,0 if there are two physical CPUs

MNIST is too small to bench. IO is the main bottleneck.

You have an error in your tensorflow code @altosaar
You are setting t0 = time.time() right before computing: (time.time() - t0)
Hence the number through the roof for tensorflow (0.5M iter/sec on CPU should have startled you 馃槃 )

After fixing that, using your benchmark and rewriting the metrics, MXNet is twice faster 馃帀 :

MXNet:

Iter 11000  ELBO: -102.9     Examples/s: 24981.99
Iter 12000  ELBO: -104.8     Examples/s: 26717.71

Tensorflow:

Iteration: 10000 ELBO: -96.456 Examples/s: 10878.597
Iteration: 11000 ELBO: -103.466 Examples/s: 10898.741

As an additional advice, always use speed metric that are easy to comprehend, Example/sec is a good one. sec/iter not so much. Otherwise you would have noticed faster that 1.929e-06sec/iter (33M image/sec) was the abnormal one 馃槂

Dang, I knew it was a silly bug on my end鈥搕hanks for catching that @ThomasDelteil :) I just pushed the fix. You're right, should have caught that by realizing that millions of iterations/s is very unreasonable.

Here are the new timings I get:

TensorFlow 1.7.0 CPU:

Iteration: 1000 ELBO: -131.288 s/iter: 5.380e-03
Iteration: 2000 ELBO: -122.167 s/iter: 5.253e-03

TensorFlow 1.7.0 GPU:

Iteration: 1000 ELBO: -142.142 s/iter: 3.681e-03
Iteration: 2000 ELBO: -114.007 s/iter: 3.725e-03

This matches the MXNet timings on GPU 馃憤 :) and it's awesome that it's a lot faster on CPU!

P.S. I agree examples/s is good in some cases. For generative models, I find time/iteration more informative (the convergence of the objective should be measured in number of parameter updates, not epochs, so this is what I focus on).

Was this page helpful?
0 / 5 - 0 ratings