Tensor2tensor: difference between sync and async distributed training.

Created on 31 Jan 2018 · 6Comments · Source: tensorflow/tensor2tensor

Hi, I am trying to understand the distributed training perf on GPU clusters, little confused about the following 2 distributed parallel mode:

1 sync mode will run worker jobs on PS GPU devices, and shard the variables across GPU0 in PS replicas.

2 async mode will shard the variables across PS services (GPU0 for PS replicas), but running the worker job on worker replicas.

right?

what's the sync and async here means? why it is called sync and async? what's the motivations for the sync and async mode?

Could someone give some explain?

Thanks.

question

Source

shawnwang18

Most helpful comment

The naming makes things a bit confusing. Let me try to clarify things.

In sync mode, the worker device is actually quite dumb and isn't doing much other than calling Session.run, which launches the actual work to be done on the parameter servers (that's where the confusing naming is - the parameter servers in this case are the "workers" and the worker is actually just a master). Because all work is happening synchronously across the PS workers, noise is reduced - it's equivalent to multiplying your batch size by the number of PS workers you have.

In async mode, the workers are each running independently and running training on themselves. The parameter servers are being used only for parameters. This is noisier because there is no synchronization between the workers.

rsepassi on 9 Feb 2018

👍3

All 6 comments

I observe the same behavior and I want to know the intuitions for this implementation. Besides, what's the workitem for worker in sync mode? Should I grant GPU to workers?

WencongXiao on 2 Feb 2018

According to my understanding of the code, the program will call the model function on the "datashard_device".
In single-machine training (and async mode), the datashard_device will be the devices assigned to the worker process.
However, in sync mode, the datashard_device will be those assigned to the ps process. So the training is actually running in the ps process.
This is to some extent counter intuitive, since a common practice is just use ps to do model sync. I wonder why we use this design and what the benefit is.
Thanks.

zhypku on 3 Feb 2018

The naming makes things a bit confusing. Let me try to clarify things.

rsepassi on 9 Feb 2018

👍3

I think the ps naming vs worker is indeed reversed in T2T, but it seems tedious to change.

lukaszkaiser on 10 Feb 2018

@rsepassi
You said "it's equivalent to multiplying your batch size by the number of PS workers you have."
The "number of PS workers" is the number of GPUs allocated to PS, right? Say, if I have two PS processes, each of which has 2 GPUs, then the global batch size is hparams.batch_size * 4. Is that right?

zhypku on 11 Feb 2018

Correct.

On Sun, Feb 11, 2018 at 12:14 AM Hanyu Zhao notifications@github.com
wrote:

@rsepassi https://github.com/rsepassi
You said "it's equivalent to multiplying your batch size by the number of
PS workers you have."
The "number of PS workers" is the number of GPUs allocated to PS, right?
Say, if I have two PS processes, each of which has 2 GPUs, then the global
batch size is hparams.batch_size * 4. Is that right?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensor2tensor/issues/550#issuecomment-364732258,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABEGW1MwSkus2dOmJkelrSi0ES-cQse6ks5tTqF3gaJpZM4RzXjt
.