Hi all:
I was playing with Block Momentum in Python.
I see (here) that the only value to play with is the block_size.
What’s the typical rule-of-thumb for selecting the block_size? Lets say, my full dataset and epoch size is 60000 and number of workers is 4.
Also, I saw in your presentations that Block Momentum “Combines model averaging with error-residual idea”. Does it mean, one can use 1-bit SGD? But, here I see that 1-bit SGD is not allowed.
So, what do you mean by error residual here?
Block size is data set dependent, thus it's difficult for give general recommendations. For Block Momentum to work well, make sure you use warm-up at the beginning.
In theory, Block Moment can be combined with 1-bit SGD, but it's not currently supported in CNTK. Block Momentum reduces the communication frequency and it is thus most likely sufficient by itself for distributed training.
Thanks! Block momentum is cool.
Some more questions:
In your CVPR Tutorial you mentioned about automatic mini-batch sizing. Does it happen automatically behind the scenes? Or as a user can we turn it ON or OFF?
Here in section 5.2, I see that automatic mini-batch sizing can be turned ON or OFF with the BrainScript. Is it also possible to do that with Python API?
I see that block-momentum gives the maximum benefit. 1-bit SGD is a cool idea, but do you know of any simple example where I can easily see a benefit with 1-bit SGD?
Please close this Open Question after answering.
I feel I have a good understanding now (apart from the 2 questions above).
No, automatic mini-batching sizing doesn't happen behind the scene. Currently it's only available in BrainScript, and we are yet to port it to Python. However, in Python you can do manual minibatch sizing similar to the recent FB work using minibatch size scheduler.
The benefit of 1-bit SGD is its simplicity. There is no parameter to be tuned. For Block Momentum there is till parameters to tune. I think in general Block Momentum is a superior algorithm to 1-bit SGD.