Hi, I am trying to understand the distributed training perf on GPU clusters, little confused about the following 2 distributed parallel mode:
right?
what's the sync and async here means? why it is called sync and async? what's the motivations for the sync and async mode?
Could someone give some explain?
Thanks.
I observe the same behavior and I want to know the intuitions for this implementation. Besides, what's the workitem for worker in sync mode? Should I grant GPU to workers?
According to my understanding of the code, the program will call the model function on the "datashard_device".
In single-machine training (and async mode), the datashard_device will be the devices assigned to the worker process.
However, in sync mode, the datashard_device will be those assigned to the ps process. So the training is actually running in the ps process.
This is to some extent counter intuitive, since a common practice is just use ps to do model sync. I wonder why we use this design and what the benefit is.
Thanks.
The naming makes things a bit confusing. Let me try to clarify things.
In sync mode, the worker device is actually quite dumb and isn't doing much other than calling Session.run, which launches the actual work to be done on the parameter servers (that's where the confusing naming is - the parameter servers in this case are the "workers" and the worker is actually just a master). Because all work is happening synchronously across the PS workers, noise is reduced - it's equivalent to multiplying your batch size by the number of PS workers you have.
In async mode, the workers are each running independently and running training on themselves. The parameter servers are being used only for parameters. This is noisier because there is no synchronization between the workers.
I think the ps naming vs worker is indeed reversed in T2T, but it seems tedious to change.
@rsepassi
You said "it's equivalent to multiplying your batch size by the number of PS workers you have."
The "number of PS workers" is the number of GPUs allocated to PS, right? Say, if I have two PS processes, each of which has 2 GPUs, then the global batch size is hparams.batch_size * 4. Is that right?
Correct.
On Sun, Feb 11, 2018 at 12:14 AM Hanyu Zhao notifications@github.com
wrote:
@rsepassi https://github.com/rsepassi
You said "it's equivalent to multiplying your batch size by the number of
PS workers you have."
The "number of PS workers" is the number of GPUs allocated to PS, right?
Say, if I have two PS processes, each of which has 2 GPUs, then the global
batch size is hparams.batch_size * 4. Is that right?—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensor2tensor/issues/550#issuecomment-364732258,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABEGW1MwSkus2dOmJkelrSi0ES-cQse6ks5tTqF3gaJpZM4RzXjt
.
Most helpful comment
The naming makes things a bit confusing. Let me try to clarify things.
In sync mode, the worker device is actually quite dumb and isn't doing much other than calling
Session.run, which launches the actual work to be done on the parameter servers (that's where the confusing naming is - the parameter servers in this case are the "workers" and the worker is actually just a master). Because all work is happening synchronously across the PS workers, noise is reduced - it's equivalent to multiplying your batch size by the number of PS workers you have.In async mode, the workers are each running independently and running training on themselves. The parameter servers are being used only for parameters. This is noisier because there is no synchronization between the workers.