CNTK speed: BrainScript vs Python

Created on 30 Dec 2016  路  9Comments  路  Source: microsoft/CNTK

Hi,

I tested the training speed of two examples from the standard CNTK distribution:

  1. BrainScript: cntk/Examples/Image/GettingStarted/01_OneHidden.cntk
  2. Python: cntk/Examples/Image/Classification/MLP/Python/SimpleMNIST.py

Both examples have the same model. I trained on GPU on both cases. Here are the results:

  1. BrainScript: ~ 95,000 samples per second
  2. Python: ~61,000 samples per second

Is there something wrong?

Most helpful comment

Asynchronous CPU/GPU transfers are also missing from the C# bindings, unless you use CNTK.MinibatchSource.TextFormatMinibatchSource.

The method suggested in #2516 (Calling Value.CreateBatch) leads to very low GPU utilisation (even when done on a separate thread).

Ideally there would be a C# version of MinibatchSourceFromData (like in the python bindings), that uses asynchronous GPU/CPU transfers.

All 9 comments

When answering this question, can someone please explain me the advantages/disadvantages of BrainScript vs. Python?
Do we need to keep track of both?

Hi @arijit17,

Python is, currently, a little bit slower than BrainScript because of the overhead introduced by the conversion of Python Objects to CNTK's C++ defined types.

I asked myself yesterday how much this overhead could be, and here is a graph of the operations involved for a prediction (evaluate()) and a minibatch training (train_minibatch):

image

You can see that each call to both method will result in subsequent calls :

  • wrapper()
  • sanitize_batch()
  • etc.

On the other hand, on the BrainScript side, all the operations made by CNTK are done on C++ objects directly, without the overhead of mapping Python types to C++.

Edit :

  • Python is a very flexible way to use CNTK within an IDE, with all the Python's features
  • Brainscript is a little more restricted (data formats, preprocessing, no IDE support, etc.) but is currently the fastest way to use CNTK.

Hope it helps :)
Morgan

Hi Morgan: thanks for the analysis!

SimpleMNIST.py has a small minibatch size and caused the python overhead in loop to stand out. If you increase the minibatch size to 1024 (depends on your GPU memory), you'll notice smaller gaps.

Is it possible to create MinibatchData/Value/NDArrayView objects only once and then move a data from numpy arrays?

Yes, you can use Value.create to create the value object on GPU. However, the copy from numpy array to GPU here is done synchronously so if you do it per minibatch you still would have CPU/GPU stalls. CNTK readers asynchronously copies the data to GPU while computation is going on to eliminate the stall.

Indeed, caching MinibatchData/Value/NDArrayView objects inside the custom UserMinibatchSource speed up training.

When you said "asynchronously", did you mean ReaderShim class with DataTransferer? It would be great to extend SwigMinibatchSource class with such features for further speed up UserMinibatchSource.

Yes DataTransferer does the async copy from CPU to GPU. We are working on exposing this to user, keep tuned.

Asynchronous CPU/GPU transfers are also missing from the C# bindings, unless you use CNTK.MinibatchSource.TextFormatMinibatchSource.

The method suggested in #2516 (Calling Value.CreateBatch) leads to very low GPU utilisation (even when done on a separate thread).

Ideally there would be a C# version of MinibatchSourceFromData (like in the python bindings), that uses asynchronous GPU/CPU transfers.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

robinhad picture robinhad  路  61Comments

StevenGann picture StevenGann  路  125Comments

haryngod picture haryngod  路  17Comments

loretoparisi picture loretoparisi  路  19Comments

cha-zhang picture cha-zhang  路  49Comments