Ray: [sgd] A local worker will always be created for PytorchTrainer?

Created on 6 May 2020  路  19Comments  路  Source: ray-project/ray

Hi,

I'm new to RaySGD and am trying to use PyTorchTrainer for distributed training.
I glanced at a code, and see that in all cases there would be a worker on the driver node and other workers probably on other ray nodes, am I understanding correctly?
So what if I don't have enough resources on the driver?
Also, when creating PyTorchRunner, no resource is specified? Then when training, only 1 cpu will be used for each runner? But as I notice, more than 1 cpu is used actually.
Hope to get some clarification and help on this! Thanks so much!
(I'm using Ray 0.8.4 and torch 1.5.0 btw)

question sgd

Most helpful comment

By doing this, TorchTrainer becomes an actor and would be placed on one node; and thus the local worker would be placed on that node as well. Am I understanding correctly?

Yep!

Okay. I will try this. I'm not familiar with tune and will study this first :) Thanks so much!

Yep, feel free to ask any questions as you do!

All 19 comments

@hkvision thanks for opening up this PR! What do you mean by not having enough resources on the driver? Like the driver doesn't have a GPU?

@hkvision thanks for opening up this PR! What do you mean by not having enough resources on the driver? Like the driver doesn't have a GPU?

Hi @richardliaw Thanks for your reply. Yes. I mean like in YARN settings, the driver is only for scheduling and probably the driver hardware is not as good as other workers. Thus for distributed training, if there is always a process on the driver, then it will be the bottleneck for syncing parameters?

Makes sense.

My suggestion for you is to "lift" the TorchTrainer into an actor:

RemoteTrainer = ray.remote(num_gpus=1)(TorchTrainer)
remote_trainer = RemoteTrainer.remote(data_creator=...)

remote_trainer.train.remote()

Another option is simply to use Tune, which will allow you to keep all of the fault tolerance guarantees and also provide utilities such as auto-checkpoint, tensorboard:

tune.run(TorchTrainer.as_trainable(data_creator..)

Makes sense.

My suggestion for you is to "lift" the TorchTrainer into an actor:

RemoteTrainer = ray.remote(num_gpus=1)(TorchTrainer)
remote_trainer = RemoteTrainer.remote(data_creator=...)

remote_trainer.train.remote()

By doing this, TorchTrainer becomes an actor and would be placed on one node; and thus the local worker would be placed on that node as well. Am I understanding correctly?

Another option is simply to use Tune, which will allow you to keep all of the fault tolerance guarantees and also provide utilities such as auto-checkpoint, tensorboard:

tune.run(TorchTrainer.as_trainable(data_creator..)

Okay. I will try this. I'm not familiar with tune and will study this first :) Thanks so much!

By doing this, TorchTrainer becomes an actor and would be placed on one node; and thus the local worker would be placed on that node as well. Am I understanding correctly?

Yep!

Okay. I will try this. I'm not familiar with tune and will study this first :) Thanks so much!

Yep, feel free to ask any questions as you do!

https://github.com/ray-project/ray/blob/master/python/ray/util/sgd/torch/torch_trainer.py#L336
@richardliaw Want to ask here why creating Runner actor, it forces cpu=1? So if I don't have gpu, each worker only runs one cpu core? Would this restriction hinder the performance?

Also seems if I run two workers locally, the workers actually use more than 1 cpu core. For multiple nodes, each worker only uses one core... (weird, I will have more test on this...)
Hope to get some help!

Restriction doesn't hinder performance. It's only for bookkeeping.

Also seems if I run two workers locally, the workers actually use more than 1 cpu core. For multiple nodes, each worker only uses one core...

Can you tell me more about this?

Also, are you CPU only? Why not try to use GPU?

Also, are you CPU only? Why not try to use GPU?

Yeah. Coz our team works on running on cpu servers... so we use cpu only for now...
Under the case I use cpu only, the code forces the resource of each worker to 1 cpu, won't it hinder the performance? since each worker only runs one core...
Thanks!

Any response? @richardliaw Thanks :)

Ah yeah; we don't do CPU isolation in Ray, so 1 CPU doesn't limit the performance of a single worker.

Ah yeah; we don't do CPU isolation in Ray, so 1 CPU doesn't limit the performance of a single worker.

Oh really? Even if the worker only has 1 cpu resource, if I set something like OMP_NUM_THREADS=10, it can still use 10 cores indeed?

But one problem is in the current code, all workers have 1 cpu resources. So it is very likely that all the workers are placed on the same worker with 10 cpus. Then even if they are use more than 1 cpu, they would still competing limited resources. Any idea on this?

Or can future versions make num_cpus configurable?

@hkvision sorry for the slow reply.

Oh really? Even if the worker only has 1 cpu resource, if I set something like OMP_NUM_THREADS=10, it can still use 10 cores indeed?

yes

But one problem is in the current code, all workers have 1 cpu resources. So it is very likely that all the workers are placed on the same worker with 10 cpus. Then even if they are use more than 1 cpu, they would still competing limited resources. Any idea on this? Or can future versions make num_cpus configurable?

Yeah, perhaps we should allow the number of CPUs per node to be customizable. Would something like

Trainer(num_cpus_per_worker=4)

work for you?

@richardliaw No problem. Thanks so much for your reply. Yeah, it would be great to make cpu number customizable.
May I ask what's your previous assumption when specifying num_cpus? As specifying this doesn't actually control the cpu usage for actors, what's the usage of num_cpus?

Actually @ConeyLiu just pushed a PR that addresses this issue and I'll try to merge it.

The assumption was that the workload will primarily be GPU intensive, so CPU reservations wouldn't matter so much.

Okay! Much thanks for your help!

Fixed in https://github.com/ray-project/ray/pull/8963! Please let me know if that works for you.

Was this page helpful?
0 / 5 - 0 ratings